Prediction Weighted Maximum Frequency Selection
Hongmei Liu, J. Sunil Rao

TL;DR
This paper introduces a novel variable selection method combining adaptive LASSO and Elastic-Net estimators with a prediction-weighted maximum frequency approach, achieving consistent model selection and improved finite sample performance.
Contribution
It develops a new strategy for variable selection that enhances finite sample performance and consistency using prediction-weighted maximum frequency models with adaptive estimators.
Findings
Achieves consistent model selection with divergence of p_n.
Demonstrates excellent finite sample performance in simulations.
Extends methodology to generalized linear models.
Abstract
Shrinkage estimators that possess the ability to produce sparse solutions have become increasingly important to the analysis of today's complex datasets. Examples include the LASSO, the Elastic-Net and their adaptive counterparts. Estimation of penalty parameters still presents difficulties however. While variable selection consistent procedures have been developed, their finite sample performance can often be less than satisfactory. We develop a new strategy for variable selection using the adaptive LASSO and adaptive Elastic-Net estimators with diverging. The basic idea first involves using the trace paths of their LARS solutions to bootstrap estimates of maximum frequency (MF) models conditioned on dimension. Conditioning on dimension effectively mitigates overfitting, however to deal with underfitting, these MFs are then prediction-weighted, and it is shown that not only can…
| Criteria | Ten-fold CV error | Test error | Number of genes |
|---|---|---|---|
| WMF | 0/38 | 5/34 | 5 |
| CV | 0/38 | 4/34 | 13 |
| Cp | 0/38 | 4/34 | 18 |
| BIC | 0/38 | 4/34 | 18 |
| EBIC | 1/38 | 6/34 | 5 |
| GIC | 1/18 | 6/34 | 5 |
| Criteria | Ten-fold CV error | Test error | Number of genes |
|---|---|---|---|
| WMF | 0/38 | 4/34 | 10 |
| CV | 1/38 | 6/34 | 42 |
| Cp | 1/38 | 6/34 | 36 |
| BIC | 1/38 | 7/34 | 34 |
| EBIC | 1/38 | 7/34 | 34 |
| GIC | 1/38 | 7/34 | 21 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Direction-of-Arrival Estimation Techniques · Soil Geostatistics and Mapping
**Prediction Weighted Maximum Frequency Selection
**Hongmei Liu and J. Sunil Rao *Division of Biostatistics, University of Miami, Miami, FL 33136
Shrinkage estimators that possess the ability to produce sparse solutions have become increasingly important to the analysis of today’s complex datasets. Examples include the LASSO, the Elastic-Net and their adaptive counterparts. Estimation of penalty parameters still presents difficulties however. While variable selection consistent procedures have been developed, their finite sample performance can often be less than satisfactory. We develop a new strategy for variable selection using the adaptive LASSO and adaptive Elastic-Net estimators with diverging. The basic idea first involves using the trace paths of their LARS solutions to bootstrap estimates of maximum frequency (MF) models conditioned on dimension. Conditioning on dimension effectively mitigates overfitting, however to deal with underfitting, these MFs are then prediction-weighted, and it is shown that not only can consistent model selection be achieved, but that attractive convergence rates can as well, leading to excellent finite sample performance. Detailed numerical studies are carried out on both simulated and real datasets. Extensions to the class of generalized linear models are also detailed. Key Words: Adaptive LASSO, adaptive Elastic-Net, model selection, bootstrap.
1 Introduction
Consider the standard linear regression model
[TABLE]
where is a vector of responses, is an design matrix of predictors, is a vector of unknown regression parameters, is a vector of independent and identically distributed (i.i.d.) random errors. We allow to increase with .
Because some elements of might be [math], a family of penalized least squares estimators were developed for variable selection and estimation,
[TABLE]
where is the -norm, are regularization parameters, and is positive valued for . [1] pointed out that through variable selection one can focus on a small number of important predictors for enhanced scientific discovery and potentially improve prediction performance by removing noise variables.
Penalized -regression is a special case of (1.2) with , which includes the best subset selection for ; the LASSO [2] for and the ridge regression [3] for . Best subset selection is known to be computationally infeasible for high dimensional data and inherently discrete in variable selection [4]. Ridge regression does not possess a variable selection property. The LASSO however, can do simultaneous estimation and variable selection because its penalty is singular at the origin and can shrink some coefficients to exact 0 with a sufficiently large [5]. Other penalized least squares estimators that can do simultaneous estimation and variable selection include the SCAD [5] and adaptive LASSO [6] both enjoying the oracle properties [5]; the Elastic-Net [7] capable of detecting grouped effects; the adaptive Elastic-Net [8] combining advantages of the adaptive LASSO and Elastic-Net; and etc.
Selection of is essential in above penalized least squares estimation procedures. Although methods such as the SCAD, adaptive LASSO and adaptive Elastic-Net enjoy the oracle properties asymptotically, their optimal properties rely on particular specifications of the , whose magnitude controls the complexity of a selected model and trade-off between bias and variance in estimators [9]. The multi-fold cross-validation (CV) and generalized cross-validation (GCV) are frequently applied for the tuning parameters selection [2, 5, 6, 7]. But they overfit the model asymptotically [10]. For consistent variable selection, [11] suggested to use the BIC in adaptive LASSO and a modified BIC when is diverging [12]; [13] introduced an extended BIC (EBIC) for linear models and then generalized it to generalized linear models [14]; [15] put forward a generalized information criterion (GIC) with diverging; [16] provided a consistent cross-validation procedure (CCV) for the LASSO; [17] proposed the stability selection (SS) for their randomized LASSO. Although variable selection consistency was established for these procedures, their finite sample performance can often be less than optimal (Section 6 ahead demonstrates this in simulation studies).
We propose a new method for tuning parameters selection, focusing in particular on the adaptive LASSO and adaptive Elastic-Net estimators. A simple example helps to illustrate the basic idea. Consider the adaptive LASSO in following example.
Example 1. Data are drawn from model (1.1) with , row vectors of the design matrix with and for . So the true model size here is 3.
Figure 1 (top) shows the full adaptive LASSO solution path from the LARS algorithm [18]. In the figure, each step indicates a dimension change in the estimator. These steps are called transition points in [19]. They showed that if using information criteria such as the AIC or BIC to identify the optimal in adaptive LASSO, it lies in one of the transition points. This result helps to justify uses of the LARS algorithm and our subsequent focus on the transition points. Then the question remains about how to choose from these transition points.
Figure 1 (middle and bottom) gives a brief look at our proposed method. The middle panel shows the estimated maximum frequency (MF) of a candidate model given the dimension. The MF estimation is done by a bootstrapping algorithm using the transition points. The strategy of conditioning on dimension has two important consequences: i) for overfit dimensions, the MFs are dramatically smaller than the true dimension MF (other than the full model of course), and ii) underfit dimensions can also produce large MF values. Point i) is important because for variable selection, overfitting is usually much more difficult to deal with. So one must now deal with the underfitting issue. We do this by introducing a prediction-based weight to the MFs (labeled as WMF). The results are shown in bottom panel of Figure 1. As is evident, now the true dimension, which maps to the true model, stands out beautifully from all others.
The rest of the paper is organized as follows. In Section 2, we briefly review the adaptive LASSO and adaptive Elastic-Net estimators and introduce the bootstrap algorithm for each. In Section 3, the MF procedure itself is described and its underlying properties are carefully examined using a simple orthogonal design. In Section 4, asymptotic properties of the MF procedure are established in general settings. The WMF procedure and its variable selection consistency are presented in Section 5. Comprehensive simulation studies are shown in Section 6. Section 7 describes extensions of the MWF procedure to generalized linear models (GLMs). Applications of the WMF procedure to ultra-high dimensional data are discussed in Section 8.
2 Bootstrapping the adaptive LASSO and adaptive Elastic-Net estimators
Denote the true value of with model size , and a consistent estimate of . The adaptive LASSO [6] estimator is
[TABLE]
where , . It was suggested to use the ordinary least-squares (OLS) estimator or the best ridge estimator (if collinearity exists) for . Under certain regularity conditions, was shown to enjoy the oracle properties.
The Elastic-Net estimator [7] is
[TABLE]
It overcomes several limitations pertaining to the LASSO: 1) the added penalty is strictly convex to allow grouping effects; 2) In a case, it can potentially estimate all predictors, while the LASSO can only find at most predictors.
[8] proposed the adaptive Elastic-Net to combine strengths of the Elastic-Net and adaptive LASSO. The adaptive Elastic-Net estimator is
[TABLE]
where , and is the Elastic-Net estimator in (2.2). Note that takes the same value for the penalty function in (2.2) and (2.3), because the penalty contributes to the same kind of grouping effects. On the other hand, and are allowed to be different as they control the sparsity in estimators. Under some regularity conditions, was shown to enjoy the oracle properties.
We now detail bootstrapping for these two estimators. There are typically two ways of generating bootstrap observations for model (1.1) [20].
-
Bootstrapping pairs [21]. Let be the empirical distribution putting mass on each data pair . Generate i.i.d. paired bootstrap data from . The bootstrap analog of , denoted as , is to replace with in (2.1) where and . So is the bootstrap analog of , denoted as . Under the weak condition that , almost surely [20].
-
Bootstrapping residuals [22]. Calculate the th residual
[TABLE]
where is a ridge estimate of . Generate i.i.d. bootstrap residuals from the empirical distribution that puts mass on each centered residual, , where is the average of . Then the i.i.d. residual bootstrap data is where . The bootstrap analog of , denoted as , is to substitute with in (2.1). So is the bootstrap analog of , denoted as .
In next section, we introduce the MF procedure which takes use of above bootstrap estimators.
3 The MF procedure
Denote a -dimensional candidate model from the th bootstrap data as .
Remark 1**.**
In the 4th step, the full model is excluded because it will destroy the maximum frequency rule by having the highest frequency, , all the time. If there is a tie at the maximum of , we select the one at the highest dimension. This strategy guarantees asymptotic variable selection consistency of the MF procedure, which will be discussed in Section 4.
The MF procedure for adaptive Elastic-Net is in parallel. But in the 2nd step, the LARS-EN algorithm [7] is used instead to fit each bootstrap data.
We discussed in introduction to this paper consequences of the MF procedure by conditioning on dimension. Here we use a simple orthogonal design with i.i.d. normal random errors to study underlying properties driving that performance. In this case, we have and the adaptive Elastic-Net reduces automatically to the adaptive LASSO [8]. Denote the th column of . Then the adaptive LASSO estimator is
[TABLE]
where equals to if otherwise 0. We can expand the by
[TABLE]
where . The following Lemma gives an order relationship for ’s.
Lemma 1**.**
Suppose , then we have
[TABLE]
for .
In combine with the fact that asymptotically for , it is easy to deduce from (3.1) that given a adaptive LASSO tends to select those variables, corresponding to the first largest ’s, with the highest probability.
Without loss of generality, suppose is decreasingly ordered. Denote a -dimensional model containing the first elements of , and denote any other -dimensional models. Let be an adaptive LASSO model estimate given the model size is , indicates the conditional probability of given the model size. Then preceding deductions from (3.1) can be formularized as
[TABLE]
where and are two -dimensional models s.t. .
Above properties of the adaptive LASSO coincides to some extent with the results of Theorem 2 in [23]. By (3.3), zero predictors will be equally likely selected at an overfit dimension. As a result (see Algorithm 1 for definition of ), , drops down dramatically, which is why we see a huge gap between the true dimension and overfit dimensions in Figure 1 (middle). On the other hand, at some underfit dimensions can be as competitive as . We propose a WMF procedure to tackle this underfitting issue in Section 5.
In next section, we show asymptotic variable selection properties for and in general settings, from which variable selection consistency of the MF procedure can be deduced.
4 Asymptotic properties of the MF procedure
Let be the true model. We assume following regularity conditions for subsequent theoretical studies:
(A1) Denote and the minimum and maximum eigenvalues of a positive definite matrix . We assume
[TABLE]
where and are two positive constants.
(A2) and . The last inequation is to ensure in (A3)–(A4). Moreover,
[TABLE]
(A3) In adaptive LASSO,
[TABLE]
and
[TABLE]
(A4) In adaptive Elastic-Net,
[TABLE]
and
[TABLE]
[TABLE]
(A5) The errors are i.i.d. with mean 0 and variance .
Denote an adaptive LASSO estimate of using paired bootstrap data. Let and where . Then indicates the conditional probability of given and .
Theorem 1**.**
Suppose conditions (A1)–(A3) and (A5) hold, then
[TABLE]
Moreover, let be another tuning parameter such that the adaptive LASSO estimator under is of dimension , , then
[TABLE]
where is any -dimensional model.
Proofs of Theorem 1 are included in Appendix A.
In adaptive LASSO, given a is equivalent to given a dimension, but the converse is not true. One dimension can be mapped to numerous models, as a result to numerous tuning parameters. Fortunately however, the LARS algorithm enables us to map a dimension to an optimal . Recall the adaptive LASSO solution path from the LARS in top panel of Figure 1. Transition points (e.g. steps) from 0 to 10 corresponds to a sequence of ’s:
[TABLE]
Note that for where is the adaptive LASSO estimator under . By Theorem 5 in [19],
[TABLE]
where is the number of non-zero elements in and is a positive sequence depending on . It is worth mentioning that is optimum in by producing the minimum sum of squared errors (SSE) and the smallest model size concurrently.
Also note that the number of steps can exceed the full model size — different steps may have a same model size. Denote the last step having a model size , and is another step having the same model size. The theorem also showed that
[TABLE]
Theorefore, is the overall optimum in . So the LARS algorithm enables us to create a one-to-one map between a dimension and the optimum ,
[TABLE]
It is easy to see that will satisfy condition (A3). Hence, we have the following corollary from Theorem 1.
Corollary 1**.**
Suppose conditions (A1)–(A2) and (A5) hold, then
[TABLE]
where is any -dimensional model.
This result can also be established for adaptive Elastic-Net. Denote an adaptive Elastic-Net estimate of using paired bootstrap data.
Corollary 2**.**
Suppose conditions (A1)–(A2) and (A5) hold, then
[TABLE]
where is any -dimensional model.
Proof.
It can be proved by using the techniques for deriving Theorem 1, Corollary 1 and Theorem 2. We bypass here. ∎
We now study the estimation properties for using residual bootstrap data. Denote an adaptive Elastic-Net estimator of using residual bootstrap data.
Theorem 2**.**
Suppose conditions (A1)–(A2) and (A4)–(A5) hold, then
[TABLE]
Moreover, let be another tuning parameter such that the adaptive Elastic-Net estimator under is of dimension , , then
[TABLE]
where is any -dimensional model.
Proofs of Theorem 2 are included in Appendix A. The LARS-EN algorithm for adaptive Elastic-Net estimations is an extension of the LARS algorithm, which shares the same properties of the LARS for deriving Corollaries 1–2. Hence we obtain the following corollary from Theorem 2.
Corollary 3**.**
Suppose conditions (A1)–(A2) and (A5) hold, then
[TABLE]
where is any -dimensional model.
This result can also be established for adaptive LASSO. Denote an adaptive LASSO estimate of using residual bootstrap data.
Corollary 4**.**
Suppose conditions (A1)–(A2) and (A5) hold, then
[TABLE]
where is any -dimensional model.
Proof.
Note that the adaptive LASSO estimator is a special case of the adaptive Elastic-Net estimator with . Theorem 2 holds automatically for , from which Corollary 4 can be deduced. ∎
Variable selection consistency of the MF procedure can then be deduced from Corollaries 1–4.
Corollary 5**.**
Suppose conditions (A1)–(A2) and (A5) hold. Then the MF procedure is variable selection consistent, e.g.
[TABLE]
where is the model selected from the MF procedure.
Proof.
By definition, is an adaptive LASSO estimate of using paired or residual bootstrap data. It is easy to see that
[TABLE]
Combining with Corollaries 1 or 4,
[TABLE]
Thus the MF procedure for adaptive LASSO can consistently identify the true dimension and true model via selecting the maximum of , with the highest dimension (if there is a tie). Similarly, Corollary 2 and 3 imply variable selection consistency of the MF procedure for adaptive Elastic-Net. ∎
However, the MF procedure has potential issues in application. In Figure 1 (middle) excluding the full model case, the maximum occurs at dimension 1 instead of 3 although their MFs are both close to 1. In next section, we propose a WMF procedure to tackle this underfitting issue in application.
5 The WMF procedure
5.1 Method and Asymptotic properties
The underfitting issue in MF procedure can be deduced from Corollaries 1–4. Take for an example. Although it was shown that , the conditional probability at some underfit dimensions can also reach one, e.g. . Note that the tuning parameter leading to an underfit -dimensional estimator, denoted as , fulfills . Hence, the convergence rate of at some underfit dimensions can exceed the one at the true dimension. Therefore, the MF procedure would select an underfit model even with a sufficiently large .
In order to fix things, we introduce a weight to the MF procedure. An effective weight should be able to down-weight the underfitting MFs asymptotically, i.e. the weight is able to identify underfit dimensions and its effects does not vanish as , without significantly up-weighting the overfitting MFs.
[24] showed that the overall unconditional (on ) expected squared prediction error for the OLS estimator of under model is
[TABLE]
where indicates the size of , ,
, is a sub-matrix of whose columns are indexed by the components of and is an identity matrix.
When is a true or overfit model, it has and thus
[TABLE]
However, if is an underfit model, then for any fixed . He further assumed that
[TABLE]
which is argued in the paper to be a minimal type of asymptotic model identifiability condition. Under assumption (5.3) and by (5.1)–(5.2),
[TABLE]
where is an underfit model and is a true or overfit model. By (5.4) a formula inversely proportional to will be an ideal choice for the weight.
[25] proposed such a formula for estimating the posterior probability of the model size given the data
[TABLE]
where is an estimate of using a -dimensional model and , is a constant. We use the multi-fold CV for and define
[TABLE]
Figure 1 (bottom) shows the effect of weights in Example 1, which heavily punish underfitting MFs and have little effect on true and overfitting MFs. The WMF procedure then selects the dimension and model s.t.
[TABLE]
Recall that is a bootstrap version estimate of the posterior probability of model given the data and dimension, i.e. , along with (5.6) it has
[TABLE]
Note that BIC is a Laplace approximation to under a flat prior assumption and is variable selection consistent for adaptive LASSO [11, 12], but no convergence rate has been studied. Simulation studies in Section 6 show that BIC has a much slower empirical convergence rate than the WMF procedure.
Next we show properties of the multi-fold CV using adaptive LASSO or adaptive Elastic-Net estimators. Then variable selection consistency of the WMF procedure can be established. Let be a fixed integer and suppose . In multi-fold CV, one randomly divides a sample of observations into mutually exclusive subgroups with each subgroup containing observations, and selects the model by minimizing the following sum of squared errors
[TABLE]
where is an adaptive LASSO or adaptive Elastic-Net estimator under model using samples not in . Let and be the true or overfit models and be an underfit model. We assume following condition for asymptotic studies of the multi-fold CV procedure.
(A6) where is a positive definite matrix.
Theorem 3**.**
*Suppose conditions (A1)–(A2) and (A5)–(A6) hold, then
- the multi-fold CV for adaptive LASSO or adaptive Elastic-Net satisfies*
[TABLE]
2. model selected from the WMF procedure fulfills
[TABLE]
Proofs of Theorem 3 are included in Appendix A. Denote an underfit dimension. The ratio of is exponentially proportional to the bias term, , which is larger than 0 and does not fade as . This guarantees a good finite sample performance of the WMF procedure and a fast vanishing rate of its underfitting issues, which will be confirmed in simulation studies in Section 6.
5.2 Computation
In adaptive Elastic-Net, takes the same value in Elastic-Net for calculating the weights ’s, where the tuning parameters are chosen by minimizing the two-dimensional BIC [7]. Then computational efforts remain the same for adaptive LASSO and adaptive Elastic-Net, which are to compute a full solution path against ’s or ’s. Computational complexity of creating an entire adaptive LASSO solution path is of order [6]. It is of order for adaptive Elastic-Net[7]. Since the optimal value often occurs at an early stage, we could stop the algorithms after steps. In this case, the computational cost reduces to for adaptive LASSO and for adaptive Elastic-net.
Computational cost of a WMF procedure is then times the cost of computing an adaptive LASSO or adaptive Elastic-Net solution path.
6 Empirical studies
We now investigate empirical performances of the WMF procedure and show it outperforms the BIC, EBIC, GIC, SS, Cp, and 1se-CV (which is often recommended for variable selection) in a wide range of situations for both adaptive LASSO and adaptive Elastic-Net. The Cp did very poor in all scenarios, thus is excluded in the presentation.
In all simulations, data were generated from
[TABLE]
where and . Let for some constant , , . Results were averaged over 100 times of replications.
6.1 Simulations of the adaptive LASSO WMF procedure
Three scenarios were designed for the adaptive LASSO WMF procedure. In each scenario, and .
Scenario 1: Fixed low dimension and moderate proportion of true covariates. More specifically, set and . Then the proportion of true covariates is 0.3, and the signal to noise ratios (SNR) are respectively 2.03, 2 and 1.98 for various .
Scenario 2: Low dimension, moderate proportion of true covariates and weak signals for some true covariates. Specifically, set , then equals to 10, 17, 22 accordingly. Let grow with as follows. Initially and . Afterwards, increases by 1 for every 40-unit increment in and the new element equals to 1. As a result, the proportions of true covariates are respectively 0.3, 0.47, and 0.59, and the SNRs are 2, 2.85 and 3.69.
Scenario 3: High dimension, sparse proportion of true covariates and relatively large signals for all true covariates. In detail, set , then equals to 32, 72, 106 accordingly. Let grow in the same manner as in scenario 2, but the new elements equal to 2. Accordingly, the proportions of true covariates are 0.09, 0.11 and 0.12, and the SNRs are 2, 5.07, and 8.5.
Paired bootstrapping was used in the adaptive LASSO WMF procedure. Simulation results are summarized in Figures 2–4. In all scenarios, the proposed method has the highest degree of accuracy in identifying the true model and also enjoys a much faster convergence rate than other compared methods. The WMF procedure has an underfitting issue which vanishes quickly as increases. Other methods (except for the SS) however suffer from an overfitting issue. The sparser the model is, the more serious the issue tends to be. Performance of the SS relies on particular specifications of several unknown parameters. Although we have followed instructions in [17] for setting those parameters throughout the simulations, its performance remains erratic and unsatisfactory.
Simulations for using residual bootstrapping in the adaptive LASSO WMF procedure were also conducted. The results are presented in Appendix B, which are similar to those in above paired bootstrapping simulations.
6.2 Simulations of the adaptive Elastic-Net WMF procedure
We also designed three scenarios for the adaptive Elastic-Net WMF procedure, each of which mimics a typical structure in applications. Since the adaptive Elastic-Net fits data with grouping effects, in following simulations true covariates will be added in blocks with size 3. The SS is excluded due to its poor performance.
Scenario 4: Low dimension, moderate proportion of true covariates, weak signals for some true covariates and moderate correlations between covariates. More specifically, let , , and . Initially we have one block of true covariates, then . Elements of in the block equal to 2, the rest are 0. Afterwards, we add 1 block of true covariates for every 200-unit increment in and the new elements equal to 1. Respectively, the proportions of true covariates are 0.3, 0.35 and 0.41, and the SNRs are 2.45, 3.72 and 3.67.
Scenario 5: High dimension, sparse proportion of true covariates, relatively large signals for all true covariates and moderate correlations between covariates. In detail, let , , and . Initially set . Then true covariates follow the same adding scheme as in scenario 4. All non-zero elements in equal to 2. Respectively, the proportions of true covariates are 0.19, 0.13 and 0.11, and the SNRs are 1.79, 3.09 and 3.51.
Scenario 6: High dimension, sparse proportion of true covariates, relatively large signals for all true covariates and high correlations between grouped covariates. Specifically, let and . True covariates follow the same adding scheme as in Scenario 5, all non-zero elements in equal to 2. Moreover, true covariates within each block have correlations almost 1, while true covariates between the blocks have correlation 0. All noise covariates are i.i.d from . Respectively, the proportions of true covariates are 0.19, 0.13 and 0.11, and the SNRs are 2.84, 4.35 and 5.75.
Residual bootstrapping was used in the adaptive Elastic-Net WMF procedure. Simulation results are summarized in Figures 5–7. In scenarios 4 and 5, the proposed method has the best performance over other compared methods: on average the highest degree of accuracy in indentifying the true model; a faster convergence rate; the underfitting issue vanishes quickly. On the other hand, other methods suffer from an overfitting issue. The sparser the model is, the more serious the issue tends to be. In scenario 6, all methods do equally well because the adaptive Elastic-Net well fit the data with highly grouped effects.
Simulation results for using paired bootstrapping in the adaptive Elastic-Net WMF procedure are presented in Appendix B, which are similar to those in above residual bootstrapping simulations.
6.3 Classification analysis of the leukaemia data
We now demonstrate the WMF procedure in a real data application. The leukaemia data [26] contains genes and samples. We have 38 out of the 72 samples from the training dataset with 27 ALL’s (acute lymphoblastic leukaemia) and 11 AML’s (acute myeloid leukaemia). The remaining 34 samples are from the test dataset with 20 ALL’s and 14 AML’s. The goal of this analysis is to identify a subset of genes that can accurately predict the type of leukaemia for future data. Similar to [7], we coded the type of leukaemia as a binary response variable, denoted as , and defined the classification function as , where is the indicator function.
To improve computational efficiency, we selected 1000 candidate genes as the predictors using the sure independence screening (SIS) procedure [27]. The adaptive LASSO and adaptive Elastic-Net were then applied to explore the data. The screening and variable selection were carried out on the training dataset, while classification errors were examined on the test dataset. Both the LARS and LARS-EN algorithms were stopped after 200 steps of estimation to further reduce the computational costs. Note that since the optimal steps selected by various types of methods are much smaller than the stopping step, this strategy will not affect the variable selection.
Classification results are summarized in Tables 1–2. For adaptive LASSO, although the Cp, CV and BIC have obtained the minimal classification errors for both training and test datasets, the WMF has classification errors close to the minimum using the least number of genes. For adaptive Elastic-Net, the WMF has the minimal classification errors for both training and test datasets using the least number of genes. Thus we conclude that the WMF procedure is able to find the set of “important” genes that can largely improve the prediction accuracy.
7 Extensions
Here we investigate extensions of the WMF procedure to GLMs, which has the following generic density fuction [28]
[TABLE]
[6] had extended the adaptive LASSO to GLMs. Its estimator, , is obtained by maximizing the penalized log-likelihood,
[TABLE]
where , and is the maximum likelihood estimator. Under certain regularity conditions, was shown to enjoy the oracle properties .
The generalization of Multi-fold CV to GLMs is straightforward [24]. Define,
[TABLE]
where is a loss function, is the prediction of under model using samples not in .
Then we can extend the WMF procedure to GLMs for adaptive LASSO. In this case, we draw paired bootstrap samples in step 1 of Algorithm 1. Note that the LARS algorithm does not fit for GLMs, but we can use the coordinate descent algorithm [29] instead, which generates a solution path similar to the LARS. Hence in step 2, we use the coordinate descent algorithm to fit each bootstrap data. The rest remain the same. Asymptotic properties of the adaptive LASSO WMF procedure for GLMs can also be established by using some similar techniques for showing Theorem 1 in this paper and Theorem 4 in [6].
We demonstrate this extension through one simple example, where binary responses were generated from the logistic regression model
[TABLE]
where , , and . Simulation results were averaged over 100 times of replications and summarized in Figure 8. It shows that the WMF procedure is much more accurate in variable selection and also enjoys a faster convergence rate than other compared methods.
Extension of the adaptive Elastic-Net WMF procedure to GLMs is similar. Define the adaptive Elastic-Net estimator for GLMs as
[TABLE]
where , and is defined in (7.1) with for all ’s. The rest follow the same procedures for extension of the adaptive LASSO WMF procedure.
8 Ultra-high dimensional data
In this section, we discuss applications of the WMF procedure to ultra-high dimensional data in which . [27] proposed the sure independence screening (SIS) method for ultra-high dimensional data to reduce their dimensionality to a moderate scale, , s.t. . Afterwards a lower dimensional estimation method such as the SCAD can be applied to the reduced data. This process is called SIS+SCAD. Under some regularity conditions, they showed that the SIS has an exponentially small probability to omit true features and the SIS+SCAD retains the oracle properties if . By replacing the SCAD with adaptive Elastic-Net, the new procedure is refered to as SIS+AEnet [8], which holds the oracle properties if . Here we recommend to combine SIS with the WMF procedure when . We first use the SIS to reduce the dimensionality to , and then apply the WMF procedure to the reduced data. We call this procedure SIS+WMF.
Corollary 6**.**
Suppose conditions for Theorem 1 in [27] and Theorem 3 in this paper hold. Let . Then the SIS+WMF procedure is variable selection consistent.
Note that Corollary 6 is a direct conclusion of Theorem 1 in [27] and Theorem 3 in this paper.
9 Discussion
We proposed a prediction-weighted maximal frequency procedure to estimate the amount of regularization for adaptive LASSO and adaptive Elastic-Net. Asymptotic properties were studied with a diverging .
Central idea of the WMF procedure is the importance of conditioning on dimension, which mitigates overfitting. Underfitting can then be handled by using prediction-based weights estimated by multi-fold cross-validation. This simple recipe can also be applied to other regularization methods, say the SCAD and fused LASSO, making the WMF procedure a unified model selection criterion in regularization problems. However, asymptotic properties have yet to be studied, which will be a future topic.
Appendix A Proofs
Proof of Lemma 1.
Assume and . We have 4 cases for
[TABLE]
Let and . We have
[TABLE]
Consider case 1: and , .
Let be a positive constant. The point separates the domain of and into two parts: and . The cumulative probabilities of and in first part of the domain are respectively
[TABLE]
The probability can then be calculated from
[TABLE]
After some simple deductions, we get,
[TABLE]
If i.e. , from (A.1) we have
[TABLE]
However if i.e. ,
[TABLE]
Since , we have
[TABLE]
Note that two integrals in (A.2) have equal length of the integral intervals. Moreover the integral function is an monotonically decreasing function of for , and monotonically increasing for . Hence
[TABLE]
Combining (A.2) with (A.3), we get
[TABLE]
Other three cases can be proved in the same way. We avoid the repetitions here. ∎
Proof of Theorem 1.
By [8], enjoys the oracle properties under certain regularity conditions. And is a paired bootstrap analog of by replacing with in estimation. To simplify notations in the proof, we drop the subscript ‘’ in and .
By the KKT regularity conditions, is the unique solution of adaptive LASSO given if
[TABLE]
where is the th column of and
[TABLE]
Let and . We show that satisfies (A.4) with probability tending to 1, which is equivalent to prove
[TABLE]
where the first inequation implies .
Note that , where is an OLS or best ridge estimate of ,
[TABLE]
By Theorem 3.1 in [8],
[TABLE]
under assumption that . It is satisfied automatically for the OLS estimate.
Denote the th row of , and the element-wise product. We have
[TABLE]
Hence under conditions (A1) and (A5),
[TABLE]
Let , and . Under conditions (A1)–(A3) and (A5), the first inequation in (A.5) can be proved by
[TABLE]
where
[TABLE]
By (A.6), it has
[TABLE]
Similarly,
[TABLE]
By Theorem 3.1 in [8],
[TABLE]
For proof of the second inequation in (A.5), it suffices to show
[TABLE]
Since
[TABLE]
it follows that
[TABLE]
where , , is a constant.
For ,
[TABLE]
where indicates the size of . By (A.6), , , which indicates . Then under condition (A3), fulfills
[TABLE]
Also since
[TABLE]
we have for ,
[TABLE]
Hence (A.5) is proved. We have shown that and with probability tending to 1, where is the adaptive LASSO estimate using paired bootstrap data. Also it can be deduced from (A.7) that
. To sum up, we get .
We now prove , where is any -dimensional model, , and is a tuning parameter such that the adaptive LASSO estimator under is of dimension . Then , hence . If it also satisfies , we would have based on previous proof, which contradicts with the definition of . Therefore,
[TABLE]
To prove , by the KKT regularity conditions it suffices to show
[TABLE]
or equivalently
[TABLE]
Following previous proof, we get
[TABLE]
However,
[TABLE]
as . Similarly, . Then (A.8) holds. ∎
Lemma 2**.**
Suppose conditions (A1) and (A5) hold and in ridge estimates. Then,
[TABLE]
Proof.
Assume is a ridge estimate of ,
[TABLE]
By (A.6), E\|\boldsymbol{\hat{\beta}}-\boldsymbol{\beta}_{0}\|^{2}\leq O_{p}\big{(}{p_{n}\over n}\big{)}. Calculate centered residuals ,
[TABLE]
where each entry of , marked as , is the mean of . Denote an i.i.d bootstrap sample from the empirical distribution that puts mass on each entry of .
By definition, we have
[TABLE]
[TABLE]
and
[TABLE]
In above equation,
[TABLE]
Moreover, by the sum of squares inequality,
[TABLE]
Hence,
[TABLE]
Let
[TABLE]
where . We now prove asymptotically.
Note that
[TABLE]
with probability 1.
And by the sum of squares inequality,
[TABLE]
Then with probability 1. ∎
Proof of Theorem 2.
Let be a residual bootstrap sample, where and is the ridge estimator. Define
[TABLE]
where we dropped the subscript ‘ae’ in for simplicity.
Let
[TABLE]
we prove is the solution to (A.9) with probability tending to 1. By the KKT regularity conditions, this suffices to show
[TABLE]
or equivalently
[TABLE]
Note that where is the Elastic-Net estimator defined in (2.2). By Theorem 3.1 in [8],
[TABLE]
under condition (A4).
Let and . Then
[TABLE]
By (A.11) under condition (A4),
[TABLE]
Also by (A.11) , , which indicates . Hence
[TABLE]
Note that
[TABLE]
By (A.6),
[TABLE]
We now study . Let
[TABLE]
By using the same arguments for deriving (6.3) in [8], we can easily show
[TABLE]
On the other hand,
[TABLE]
by Lemma 2,
[TABLE]
By assembling (A.12)–(A.14), we get
[TABLE]
And
[TABLE]
Then under conditions (A1)–(A2) and (A4)–(A5),
[TABLE]
Hence (A.10) is proved. So far we have shown that with probability tending to 1, where is the adaptive Elastic-Net estimate using residual bootstrap data. To prove , we still need to show that .
Let and . By (A.6),
[TABLE]
Hence as where . Under condition (A4),
[TABLE]
which indicates as . To sum up, . Thus is proved.
We now prove , where is any -dimensional model, , and is a tuning parameter such that the adaptive Elastic-Net estimator under is of dimension . Then , hence . If it also satisfies , we would have based on previous proof, which contradicts with the definition of . Therefore,
[TABLE]
To prove , by the KKT regularity conditions it suffices to show
[TABLE]
or equivalently
[TABLE]
By following the same arguments for showing (A.10), we get
[TABLE]
∎
Lemma 3**.**
Suppose conditions (A1), (A5) and (A6) hold. Denote an overfit model including the true model, the adaptive Elastic-Net estimate from the multi-fold CV then satisfies
[TABLE]
where the adaptive LASSO estimate is a special case with .
Proof.
Here we provide a proof for the adaptive LASSO estimator. The adaptive Elastic-Net estimator can be proved by using the same arguments for deriving Theorem 3.1 in [8] and the strategies in below.
The adaptive LASSO estimator from the multi-fold CV is
[TABLE]
which satisfies
[TABLE]
Hence,
[TABLE]
The last equation holds because continuously decreases from to 0 as changes from the true model to full model. ∎
Proof of Theorem 3.
We integrate the proof for adaptive Elastic-Net and adaptive LASSO. Denote an overfit model including the true model. The is
[TABLE]
By Lemma 3, the second term in (A.15) satisfies
[TABLE]
Hence,
[TABLE]
The third term in (A.15) fulfills
[TABLE]
Hence,
[TABLE]
By substituting (A.16)–(A.17) to (A.15), we obtain
[TABLE]
Let and be two overfit models including the true model, then
[TABLE]
We now consider an underfit model . The is
[TABLE]
Let be an adaptive LASSO or adaptive Elastic-Net estimator under . The second term in (A.18) satisfies
[TABLE]
For the third term in (A.18),
[TABLE]
Hence,
[TABLE]
By substituting (A.19)–(A.20) to (A.18), we get
[TABLE]
If is an overfit model and is an underfit model, we have
[TABLE]
So the first part is proved. We then combine it with Corollaries 1–4. For any , ,
[TABLE]
And for any , ,
[TABLE]
Then model selection consistency of the WMF procedure can be deduced from (A.22)–(A.23). ∎
Appendix B Additional simulation results
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Jianqing Fan and Runze Li. Statistical challenges with high dimensionality: Feature selection in knowledge discovery. ar Xiv preprint math/0602133 , 2006.
- 2[2] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) , pages 267–288, 1996.
- 3[3] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics , 12(1):55–67, 1970.
- 4[4] Leo Breiman. Better subset regression using the nonnegative garrote. Technometrics , 37(4):373–384, 1995.
- 5[5] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association , 96(456):1348–1360, 2001.
- 6[6] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association , 101(476):1418–1429, 2006.
- 7[7] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67(2):301–320, 2005.
- 8[8] Hui Zou and Hao Helen Zhang. On the adaptive elastic-net with a diverging number of parameters. Annals of statistics , 37(4):1733, 2009.
