Improving Lasso for model selection and prediction
Piotr Pokarowski, Wojciech Rejchel, Agnieszka Soltys, Michal Frej and, Jan Mielniczuk

TL;DR
This paper introduces a new, computationally efficient method to improve Lasso for model selection across various convex loss functions, demonstrating superior accuracy and theoretical consistency compared to existing approaches.
Contribution
The paper proposes a novel model selection algorithm based on ordering Lasso coefficients and using the Generalized Information Criterion, applicable to diverse models and free from unknown parameter reliance.
Findings
The method achieves consistent model selection under weak assumptions.
Numerical experiments show improved accuracy over concave regularizations.
The algorithm is implemented in the R package "DMRnet" and outperforms existing methods.
Abstract
It is known that the Thresholded Lasso (TL), SCAD or MCP correct intrinsic estimation bias of the Lasso. In this paper we propose an alternative method of improving the Lasso for predictive models with general convex loss functions which encompass normal linear models, logistic regression, quantile regression or support vector machines. For a given penalty we order the absolute values of the Lasso non-zero coefficients and then select the final model from a small nested family by the Generalized Information Criterion. We derive exponential upper bounds on the selection error of the method. These results confirm that, at least for normal linear models, our algorithm seems to be the benchmark for the theory of model selection as it is constructive, computationally efficient and leads to consistent model selection under weak assumptions. Constructivity of the algorithm means that, in…
| n | p | SNR | ||||
|---|---|---|---|---|---|---|
| N.1.5 | 100 | 3000 | .5 | 4 | 2.3 | |
| N.1.7 | 100 | 3000 | .7 | 4 | 2.6 | |
| N.1.9 | 100 | 3000 | .9 | 4 | 3 | |
| N.2.5 | 200 | 2000 | .5 | 7 | 2.4 | |
| N.2.7 | 200 | 2000 | .7 | 7 | 2.3 | |
| N.2.9 | 200 | 2000 | .9 | 7 | 2.2 |
| n | p | |||
|---|---|---|---|---|
| B.1.5 | 300 | 3000 | .5 | |
| B.1.7 | 300 | 3000 | .7 | |
| B.1.9 | 300 | 3000 | .9 | |
| B.2.5 | 500 | 2000 | .5 | |
| B.2.7 | 500 | 2000 | .7 | |
| B.2.9 | 500 | 2000 | .9 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Improving Lasso for Model Selection and Prediction
Piotr Pokarowski 111Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland, [email protected]. The research was supported by the Polish National Science Center grant no. 2015/17/B/ST6/01878. ,Wojciech Rejchel 222Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Chopina 12/18, 87-100, Toruń, Poland, [email protected]. The research was supported by the Polish National Science Center grant no. 2015/17/B/ST6/01878 , Agnieszka Sołtys 333Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland, [email protected],
Michał Frej444Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland, [email protected], Jan Mielniczuk 555Institute of Computer Sciences, Polish Academy of Sciences, Jana Kazimierza 5, 01-248 Warsaw, Poland, [email protected]
Abstract
It is known that the Thresholded Lasso (TL), SCAD or MCP correct intrinsic estimation bias of the Lasso. In this paper we propose an alternative method of improving the Lasso for predictive models with general convex loss functions which encompass normal linear models, logistic regression, quantile regression or support vector machines. For a given penalty we order the absolute values of the Lasso non-zero coefficients and then select the final model from a small nested family by the Generalized Information Criterion. We derive exponential upper bounds on the selection error of the method. These results confirm that, at least for normal linear models, our algorithm seems to be the benchmark for the theory of model selection as it is constructive, computationally efficient and leads to consistent model selection under weak assumptions. Constructivity of the algorithm means that, in contrast to the TL, SCAD or MCP, consistent selection does not rely on the unknown parameters as the cone invertibility factor. Instead, our algorithm only needs the sample size, the number of predictors and an upper bound on the noise parameter. We show in numerical experiments on synthetic and real-world data sets that an implementation of our algorithm is more accurate than implementations of studied concave regularizations. Our procedure is contained in the R package ,,DMRnet” and available on the CRAN repository.
List of keywords: convex loss function, empirical process, generalized information criterion, high-dimensional regression, penalized estimation, selection consistency
1 Introduction
Sparse high-dimensional predictive models, where the number of true predictors is significantly smaller than the sample size and the number of all predictors greatly exceeds have been a focus of research in statistical machine learning in recent years. The Lasso algorithm, that is the minimum loss method regularized by sparsity inducing the penalty, is the main tool of fitting such models [27, 4]. However, it has been shown that the model selected by the Lasso is usually too large and that for asymptotically consistent model selection it requires the irrepresentable condition on an experimental matrix [22, 39, 25] which is too restrictive in general. The model’s dimension can be reduced without loss of the quality using the Thresholded Lasso (TL) algorithm, which selects variables with largest absolute values of the Lasso coefficients [34, 40] or by solving more computationally demanding minimization of a loss with a folded concave penalty (FCP) as SCAD [6], MCP [35] or capped -penalty [37, 25]. TL, FCP and similar methods lead to consistent selection under weaker assumptions such as the restricted isometry property [35, 37, 36, 25, 32, 7, 33]. In [23] one introduced an algorithm called Screening–Ordering–Selection (SOS) for linear model selection, which reduces successfully the model selected by the Lasso. SOS is based on the variant of TL proposed by [40] and leads to consistent model selection under assumptions similar to the restricted isometry property.
In the paper we consider two algorithms that improve the Lasso, i.e. they are model selection consistent under weaker assumptions than the Lasso. Moreover, the considered procedures are computationally simpler than FCP methods that use non-convex penalties. The first algorithm is the well-known Thresholded Lasso. Its model selection consistency in the normal linear model is proven in [34, Theorem 8] provided that conditions, which seem to be ,,minimal”, are satisfied. Our first contribution is an extension of this result (given in Theorem 1) to Generalized Linear Models (GLM). However, the TL algorithm is not constructive, because it is not known how to choose the threshold. Therefore, in the current paper we propose the second improvement of the Lasso which is the Screening–Selection (SS) algorithm. It is a two-step procedure: in the first step (screening) one computes the Lasso estimator with penalty and orders its nonzero coefficients according to their decreasing absolute values. In the second step (selection) one chooses the model which minimizes the Generalized Information Criterion (GIC) with penalty in a nested family induced by the ordering. Thus, the SS algorithm (Algorithm 1 below) can be viewed as the Lasso with adaptive thresholding based on GIC. We prove that this procedure is model selection consistent in a wide class of models containing linear models with the subgaussian noise (Theorem 3), GLM (Theorem 2) and models with convex (possibly nondifferentiable) loss functions (Theorem 4) as in quantile regression or support vector machines. The obtained results are exponential upper bounds on the selection error of SS in terms of , which parallel the known bounds for TL [34, Theorem 8] or FCP [7, Corollary 3 and 5]. For GLM our results are obtained on the basis of exponential inequalities for subgaussian random variables. In the case of predictive models with general convex loss functions we use methods from the empirical process theory.
The SS algorithm is a simplification and a generalization of the Screening–Ordering–Selection (SOS) algorithm from [23] that was proposed only for normal linear models. We show that the ordering step in SOS can be done using separability of the Lasso instead of using -statistics. Separability means that, under mild assumptions, coordinates of the Lasso corresponding to relevant predictors are larger in absolute values than those corresponding to irrelevant predictors. Moreover, we establish that the SS algorithm is model selection consistent beyond normal linear models. The new procedure can be applied to the various statistical predictive models including models with quantitative or qualitative response variables. We can use ”classical” models (the normal linear model, the logistic model) as well as modern problems involving piecewise-linear loss functions (as in quantile regression or support vector machines).
Our results state that, in contrast to TL and FCP methods, SS is constructive in the linear model with the subgaussian noise, because it does not rely on the unknown parameters as the true vector or the cone invertibility factors. Indeed, we will establish in Section 2 that the choice of the tuning parameter as
[TABLE]
leads to model selection consistency of the SS algorithm. Therefore, only depends on , and that is an upper bound on the noise parameter. The assumption that is known is common in the literature investigating theoretical properties of variable selection procedures [34, 4, 7].
Moreover, we will also prove that model selection consistency of the SS algorithm holds under weaker conditions than for competitive procedures. For instance, the algorithm of [32] requires that
[TABLE]
where is a scaled Kullback-Leibler distance between the true set and its submodels (defined in (15)) and is the minimal signal strength. However, for the SS procedure we need that
[TABLE]
which, among others, weakens significantly the beta-min condition. More detailed description of the SS algorithm and comparisons to other procedures is given in Section 2. This analysis enables us to claim that for subgaussian linear models the SS algorithm seems to be the benchmark for the theory of model selection as it is constructive, computationally efficient and leads to consistent model selection under weak assumptions. Similarly to TL or FCP, the SS algorithm becomes non-constructive, if we go beyond the linear model with the subgaussian noise. Notice that in (1), (2) and (3) we assume, similarly to [32], that predictors are scaled to have the -norm equal to However, in the rest of the paper we will use more convenient notation that the -norm of predictors is one. Therefore, the expression will not appear in analogs of (1), (2) and (3) in Section 2.
Although TL, FCP or SS algorithms use the Lasso estimators only for one value of the penalty, which is convenient for theoretical analysis, the practical Lasso implementations return coefficient estimators for all possible penalty values as in the R package LARS described in Efron et al., [5] or for a given net of them as in the R package glmnet described in Friedman et al., [10]. Similarly, using a net of penalty values, the FCP algorithm has been implemented for linear models in the R package SparseNet [21]) and for logistic models in the R package ncvreg [3]. Our contribution is also the SSnet algorithm (Algorithm 2 below), which is a practical version of the SS algorithm. SSnet uses glmnet to calculate the Lasso for a net of penalty values, then again it selects the final model from a small family by minimizing GIC. In numerical experiments we investigate properties of SSnet in model selection as well as prediction and compare them to competitive procedures. Using synthetic and real-world data sets we show that SSnet is more accurate than implementations of FCP. The variant of SSnet is contained in the R package ,,DMRnet” and available on the CRAN repository.
GIC is a popular procedure in choosing the final model in variable selection. In the literature there are many papers investigating its properties, for instance [18, 32] in linear models, [8, 15] in GLM, [16] in multivariate linear regression, [38] for SVM and [17] for general convex loss functions. GIC is often applied to pathwise algorithms under a common assumption that the true model is on this path. This condition excludes Lasso-based pathwise algorithms as it is only fulfilled under restrictive assumptions. One can overcome this problem using a three-step procedures, for instance the Lasso and non-convex penalized regression [32] or the Lasso and thresholding [17]. However, it makes the algorithm computationally more complex or one has to find a threshold that recovers the true model on the path, respectively. In contrast, the first step of the proposed procedure is related only to the Lasso. Indeed, we need only that the model with correctly separated predictors is on the path. This is guaranteed for the Lasso under mild assumptions. Therefore, it makes our procedure simpler and computationally more efficient.
The paper is organized as follows: in the next section we describe subgaussian GLM and algorithms that we work with. Moreover, we establish bounds for the selection error of the proposed procedures. These results are extended to models with general convex contrasts in Section 3. In Section 4 we investigate properties of estimators on simulated and real data sets. The paper is concluded in Section 5. All proofs and auxiliary results are relegated to the appendix.
2 Subgaussian Generalized Linear Models
In this section we start with definitions of considered models and estimation criteria. Next, we present model selection algorithms and state exponential upper bounds on their selection errors.
2.1 Models
The way we model data will encompass normal linear and logistic models as premier examples. Our assumptions are stated in their most general form which allows proving exponential bounds for probability of the selection error without obscuring their essentiality. In the paper we consider independent data , where for . We assume that for some true and a known differentiable cumulant function
[TABLE]
where denotes the derivative of Note that (4) is satisfied in particular by the Generalized Linear Models (GLM) with a canonical link function. and a nonlinear regression with an additive error. Let . For we define and similarly .
Let be a matrix of experiment and be an arbitrary subset of the full model , . As may be viewed as a sequence of zeros and ones on , denotes cardinality of . Let be a subvector of with elements having indices in , be a submatrix of with columns having indices in and denotes the rank of . Moreover, let be an orthogonal projection matrix onto the subspace spanned by columns of . Linear model pertaining to predictors being columns of will be frequently identified as . In particular, let denotes a true model that is and . Next, for and let be the norm. The only exception is the norm, for which we will use the special notation
In the further argumentation important roles are played by
[TABLE]
for They describe how much the true value differs from its projections on submodels of true set For the normal linear model they are related to the Kullback-Leibler divergence between two normal densities [23, Section 3]. Finally, we define the sum of balls
[TABLE]
We assume that which implies that (6) consists of sparse vectors.
We assume also that a total cumulant function
[TABLE]
is convex and, additionally, strongly convex at in a sense that there exists such that for all we have
[TABLE]
We note that this crucial property of the total cumulant is slightly weaker than an usual definition of strong convexity which would have a second derivative of at in place of . Let us remark that .
Moreover, we assume that centred responses have a subgaussian distribution with the same constant , that is for and we have
[TABLE]
Examples. Two most important representatives of GLM are the normal linear model and logistic regression. In the normal linear model
[TABLE]
where the noise variables are independent and normally distributed Therefore, assumptions (4), (8) and (9) are satisfied with and any respectively. In logistic regression the response variable is dichotomous and we assume that
[TABLE]
In this model assumptions (4) and (8) are satisfied with and
[TABLE]
respectively. Finally, as are bounded random variables, then (9) is satisfied with any .
2.2 Fitting Algorithms
For estimation of we consider a loss function
[TABLE]
where . It is easy to see that , and consequently for . Moreover, observe that and (8) is equivalent to strong convexity of at for all
[TABLE]
Let denotes a minimum loss estimator based on and . Denote . Note that for GLM estimator coincides with a maximum likelihood estimator for a model pertaining to .
In Algorithm 1 we present two selection procedures: the first one is the well-known Thresholded Lasso (TL) method which consists of retaining only these variables for which absolute values of their Lasso estimators exceed a certain threshold . The second one is novel and named the Screening–Selection (SS) procedure. It finds the minimal value of the Generalized Information Criterion (GIC) for the nested family which is constructed using ordering of the nonzero Lasso coefficients.
In the classical (low-dimensional) case model selection is often based on the Bayesian information criterion that is similar to the second step of Algorithm 1 with Model selection properties of this procedure are proven in [24]. In the current paper we validate that information criteria are also useful in the high-dimensional case. As it will be shown the important difference is that in the high-dimensional scenario parameter should be proportional to instead of (at least for linear models with the subgaussian noise).
2.3 A Selection Error Bound for TL
In order to make the exposition simpler we assume that columns of are normalized in such a way that for . Moreover, let .
First we generalize a characteristic of linear models which quantifies the degree of separation between the true model and other models, which was introduced in Ye and Zhang [34]. For consider a signed pseudo-cone
[TABLE]
For and let a Sign-Restricted Pseudo-Cone Invertibility Factor (SCIF) be defined as
[TABLE]
Notice that in the linear model the numerator of (13) is simply In the case one usually uses the minimal eigenvalue of the matrix to express the strength of correlations between predictors. Obviously, in the high-dimensional scenario this value is zero. Therefore, SCIF can be viewed as an useful analog of the minimal eigenvalue for the case
We let . In comparison to more popular restricted eigenvalues [2] or compatibility constants [29], variants of SCIF enable sharper estimation error bounds of the Lasso for [34, 14, 36].
The following lemma is a main tool in proving model selection consistency of the TL algorithm. For the linear model it was stated in [34, Theorem 3]. We generalize it to GLM.
Lemma 1**.**
If is convex and , then on we have
[TABLE]
.
Next, we bound the selection error of the TL algorithm in GLM.
Theorem 1**.**
If is convex, are subgaussian with and for numbers we have
[TABLE]
then
[TABLE]
Constant is used to remove multiplicative factor from the exponential bound at the expense of slightly diminishing the exponent in (14). Note that assumptions of Theorem 1 stipulate that the truncation level is contained in the interval , whose both endpoints are unknown, because and are unknown. Therefore, in practice cannot be chosen that makes the TL algorithm non-constructive. Analogous theorems for FCP for linear models and logistic regression can be found in Fan et al. [7, Corollary 3 and Corollary 5 ]. However, they require an additional assumption on the minimal eigenvalue of and the proof is more difficult. Moreover, the choice of the tuning parameters in these methods also requires unknown or .
2.4 A Selection Error Bound for SS
A scaled Kullback-Leibler (K-L) distance between and its submodels is given in Shen et al., [25, 26] as
[TABLE]
or just if we use notation (5). Different variants of the K–L distance have been often used in the consistency analysis of selection algorithms [23, Section 3.1], but defined in (15) seems to lead to optimal results [26, Theorem 1]. Now we introduce technical constants that allow to avoid ad-hoc coefficients in the main results and simplify asymptotic considerations. For given define , and . Note that and are functions of and obviously if , then . Moreover, for two real numbers we denote and
Theorem 2**.**
Assume (4), (8), (9) and that for we have
[TABLE]
Then
[TABLE]
Selection consistency, that is asymptotic correctness of now easily follows.
Corollary 1**.**
Assume that for Moreover, set and . Then . If additionally is asymptotically identifiable, i.e.
[TABLE]
Remark 1**.**
Theorem 2 determines conditions on GLM and the SS algorithm for which the bound (17) on the selection error of SS holds. Corollary 1 states the easy interpretable result: if the true model is asymptotically identifiable, then SS with minimal admissible is asymptotically consistent. Although the identifiability condition is not effectively verifiable, can be explicitly given for linear models as
[TABLE]
and for logistic models as
[TABLE]
since . Let us consider subgaussian linear models and assume that is known, which is a common condition in the literature investigating theoretical properties of variable selection procedures [34, 4, 7]. Then the parameter of SS is given constructively provided that . In contrast, TL or FCP are not constructive, because they require an additional parameter , that depends on unknown identifiability constants as SCIF.
In the literature concerning the Lasso and its modifications the smallest possible is taken as the default value, because it makes the algorithm asymptotically consistent for the largest class of models (the same approach is adopted for prediction and estimation). Such will be called the safest choice. It is interesting that for linear models GIC with given by (18) was originally derived from the minimax perspective by Foster and George [9]. They called such selection the risk inflation criterion (RIC), because it asymptotically minimises the maximum predictive risk with respect to the oracle for the orthogonal matrix of experiment .
Remark 2**.**
A generic combination of the penalized log-likelihood (as TL or FCP) with GIC is considered in [8]. In the first step the method computes a path of models indexed by and next GIC is used to choose the final model. They assume that the true model has to be on this path and use GIC with the penalty asymptotically larger than . Thus, their results are weaker and need more restrictive assumptions, which are given in [8, Section 6]. For instance, if the cumulant function defined in (4), has uniformly bounded second derivative, then we do not require its third derivative in contrast to [8]. Moreover, using Corollary 1 and assuming that the number is constant as in [8], the last step of our algorithm uses GIC with the safest choice instead of as in [8] for some with . It is worth to note that their results are obtained using the empirical processes theory, while the proof of Theorem 2 is based on elementary exponential inequalities for subgaussian variables given in subsection 2.6.
2.5 A Selection Error Bound for SS in Subgaussian Linear Models
In this part of the paper we show that SS is constructive for the linear model with the subgaussian noise. The main difference between Theorem 2 and the following Theorem 3 is that the lower bound on in Theorem 3 does not depend on the dimension of the true model and the parameter (because for subgaussian linear models).
Theorem 3**.**
Consider the linear model with the subgaussian noise. Assume that there exists such that
[TABLE]
Then
[TABLE]
The safest choice of in Theorem 3 does not depend on unknown expressions, which justifies the claim that our algorithm is constructive in the linear model with the subgaussian noise. Next, we compare the above result to Wang et al., [32] and Fan et al., [7].
Remark 3**.**
The algorithm in Wang et al., [32] has three steps: the Lasso, non-convex penalized linear regression and GIC with the parameter where with . Obviously, the SS algorithm is computationally faster, because it does not need the most time-consuming second step. Moreover, the first two steps of their algorithm form a variant of FCP, so their parameters are not given constructively as we explain in the discussion after Theorem 1. Finally, their assumptions are stronger than ours. Indeed, conditions in Wang et al., [32, Theorem 3.2, Theorem 3.5] lead to and where is defined in (15). From these two facts and Wang et al., [32, expression (3.4)] we obtain
[TABLE]
If we fix and as in Wang et al., [32, expression (3.4)], then in Theorem 3 we require only that
[TABLE]
In Wang et al., [32, Theorem 3.6] the second condition in (21) is weakened at the price of stronger conditions on the the design matrix. However, their assumptions are still more restrictive assumptions than ours.
Remark 4**.**
Fan et al., [7, Subsection 3.1] consider an algorithm based on non-convex penalization in linear models. Their procedure is model selection consistent, but is also not constructive. Indeed, in Fan et al., [7, Corollary 3] the parameter has to be in the interval, whose both endpoints depends on unknown that are analogs of in the current paper. In Fan et al., [7, Remark 5] it is shown that can be avoided, but their algorithm is still not constructive in contrast to the SS algorithm.
2.6 Exponential Bounds for Subgaussian Vectors
This part of the paper is devoted to exponential inequalities for subgaussian random vectors. They are interesting by themselves and can be used in different problems than we consider. In the current paper they are main probabilistic tools that are needed to prove Theorems 1-3. Specifically, in lemma 2 (iii) we generalize the Wallace inequality for distribution [31] to the subgaussian case using the inequality for the moment generating function in lemma 2 (ii). The last inequality is proved by the decoupling technique as in the proof of Theorem 2.1 in [13].
Lemma 2**.**
Let be a vector of zero-mean independent errors having subgaussian distribution with a constant , , and be an orthogonal projection such that . Then
- (i)
for
[TABLE]
- (ii)
[TABLE]
- (iii)
for
[TABLE]
3 Extension to General Convex Contrasts
In this part of the paper we investigate properties of the SS algorithm beyond GLM as well. The main assumption, that will be required, is convexity of the ,,contrast” function. We show that the SS algorithm is very flexible procedure that can be applied succesfully to the various spectrum of practical problems.
First, for and a contrast function we define a loss function
[TABLE]
Considering the normal linear model one usually uses the quadratic contrast
[TABLE]
as we have done in Section 2. However, it is well known that the quadratic contrast is very sensitive to the distribution of errors and does not work well, if this distribution is heavy-tailed and outliers appear. To overcome this difficulty we can use the absolute contrast
[TABLE]
Next, working with dichotomous we can apply logistic regression that belongs to GLM and has been considered in Section 2. In this case we have
[TABLE]
But there are also very popular and efficient algorithms called support vector machines (SVM) that use, for instance, the following quadratic hinge contrast
[TABLE]
Our main assumption is that the contrast function is convex with respect to All examples given above satisfy this property. Notice that they need not be differentiable nor decompose, as in (10) for GLM, into the sum of the nonrandom cumulant and the random linear term . The SS algorithm for convex contrasts is the same as in Algorithm 1.
We add few definitions and notations to those in the previous parts of the paper. We start with defining two balls: the first one is the -ball with radius The second one is the -ball with radius where is a (sparse) subset of such that Recall that is, as previously, a minimizer of Besides, let where is defined in (5). In further argumentation key roles are played by:
[TABLE]
and
[TABLE]
which are empirical processes over and -balls, correspondingly. We need also the compatibility factor that is borrowed from [4] and is an analog of SCIF defined in (13). Namely, for arbitrary a compatibility factor is
[TABLE]
where is a simplified version of (12), namely
[TABLE]
Convexity of the contrast function is the main assumption in this section. However, similarly to the previous section we need also the following strong convexity of at : there exists (, respectively) such that for each ( defined in (6), respectively) we have for
[TABLE]
Notice that in (25) we require the expected loss not the loss to be strongly convex. Therefore, the condition (25) can be satisfied easily even if the contrast function is not differentiable, for instance for absolute or quadratic hinge contrasts (see Remark 6). For GLM in section 2 the condition (11) is equivalent to (25) for that will be explained in Remark 7.
To prove exponential bounds for GLM in subsection 2.4 we use subgaussianity that allows us to obtain probabilistic inequalities in Lemma 2. In this section we need the analog of (22) of the form: there exists and constants such that for each and we have
[TABLE]
Besides, the inequality (23) is replaced by the following: there exists and constants such that for each and such that we have
[TABLE]
The detailed comparison between assumptions and results for models in this section and those for GLM in Theorem 2 is given in Remarks 7 and 8 after the main result of this section, which is now stated.
Theorem 4**.**
Fix and let be universal constants. Assume that (25), (26), (27) and
[TABLE]
Then
[TABLE]
Theorem 4 bounds exponentially the selection error of the SS algorithm. It extends Theorem 2 to the wide class of convex contrast functions. In particular, these contrasts can be nondifferentiable as in quantile regression or SVM. In Remarks 5 and 6 we discuss assumptions (26), (27) and (25) of Theorem 4, respectively. The detail comparison to Theorem 2 is given in Remarks 7 and 8. There we argue that Theorem 4 applied to GLM is only slightly worse than Theorem 2, which is devoted to GLM.
Remark 5**.**
The important assumptions of Theorem 4 are conditions (26) and (27). They can be proved using tools from the empirical process theory such that concentration inequalities [20], the Symmetrization Lemma [30, Lemma 2.3.1] and the Contraction Lemma [19, Theorem 4.12]. It is quite remarkable that to get (26) or (27) we need only one new condition. Namely, we need that the contrast function is Lipschitz in the following sense: there exists such that for all and
[TABLE]
Indeed, (26) with follows from the above-mentioned tools and can be established as in [4, Lemma 14.20] combined with [20, Theorem 9]. On the other hand, to get (27) with we need (31) to be satisfied for all and This fact can be obtained as in [4, Lemma 14.19] combined again with [20, Theorem 9]. Notice that logistic and absolute contrast functions satisfy (31) with and respectively. The property (31) is also satisfied for the quadratic hinge contrast, but in this case depends on .
Remark 6**.**
The condition (25) is often called the ”margin condition” in the literature. For quadratic and logistic contrasts we have considered it in the previous section. To prove it for SVM with the quadratic hinge contrast one can use methods based on the modulus of convexity [1, Lemma 7]. For linear models with the absolute contrast it can be established analogously to [17, Lemma 3], if densities of noise variables are lower-bounded in a neighbourhood of the origin.
Remark 7**.**
*Similarities to Theorem 2. *
We compare Theorem 2 to Theorem 4 applied to GLM. We can calculate that for quadratic and logistic contrasts we have
[TABLE]
and
[TABLE]
where is a total cumulant function (7). Therefore, the condition (25) for -balls is the same as (11). Besides, we have for that
[TABLE]
In Theorem 2 we supppose that are independent and subgaussian. These assumptions are used to establish (22) and (23), that are crucial in the proof of Theorem 2. Notice that (32) implies that for GLM we have and Therefore, assumptions (26) and (27) are analogs of (22) and (23), respectively. Moreover, the condition (28) and the result in (30) differ only in constants from their counteparts in Theorem 2.
Remark 8**.**
*Differences from Theorem 2. *
The SS algorithm consists of two steps. In the last paragraph we have clarified that the theoretical analysis of the second step (selection) in Theorem 2 is not significantly simpler than for general models with convex contrasts. However, we can find differences while investigating the first step (screening based on the lasso). Working with GLM we exploit differentiability of contrasts and the useful decomposition (10). Due to that the right-hand side of (16) is usually better than (29), because in Theorem 4 we have to assume (25) also with respect to -balls and appears in (29).
4 Experiments
While convenient for theoretical analysis TL, FCP or SS algorithms use the Lasso estimators only for one value of the penalty, the practical Lasso implementations return coefficient estimators for a given net of it as in the R package glmnet described in [10]. Similarly, using a net of penalty values, the Minimax Concave Penalty (MCP) algorithm, a popular realization of FCP, has been implemented for linear and logistic models in the R packages SparseNet [21] and ncvreg [3]), respectively. Thus, we propose a net modification of the SS algorithm, which we call SSnet and state in Algorithm 2. In the first step this procedure calculates the Lasso for a net of values of an input grid: . Then the final model is chosen using GIC in a similar way to the SS algorithm.
We remind that for linear models depends on the noise variance, which is unknown in practice. Estimation of is quite difficult, especially in the high-dimensional scenario. Computing the Lasso for the whole net of tuning parameters is a simple way to overcome this problem. Obviously, in the second step of our procedure we use GIC, so we require an estimator of But in this step we work with models, whose dimensionality is significantly reduced by the Lasso. Therefore, the classical estimator of , which uses residual sum of squares, can be applied.
We performed numerical experiments fitting sparse linear and logistic models to high-dimensional benchmark simulations and real data sets. We investigated properties in model selection and prediction of the SSnet algorithm and its competitor, which was a net version of MCP. The fair comparison of these two net algorithms is difficult, because their input grids depend on these algorithm themselves and their runs. Therefore, we decide to introduce an output grid, which is the same for both algorithms in Algorithm 2. We compare the algorithms using graphs, which describe the interplay between a prediction error (PE) and a model dimension (MD). In these graphs we show the averaged PE and MD over simulation runs (data sets) for distinct values from the output grid. In particular, we want to find the minimum PE and the optimal penalty. Basing only on input grids, it would be more difficult to interpret such averaging as well as to locate the PE minimum. Finally, using such plots makes our procedure more user-friendly. Namely, observing the relation between dimensionality of a model and its prediction accuracy a user can decide, which estimator is best suited for both: describing the phenomena and his/her expectations.
4.1 Simulated Data
For linear models we studied the performance of two algorithms: SSnet and MCP computed using the R package SparseNet [21] for the default 9 values of and 50 values of . Our algorithm used the R package glmnet [10] to compute the Lasso estimators for 50 lambdas on a log scale.
We generated samples , from the normal linear model. Two vectors of parameters were considered: as in [32] as well as , where equals 1 or -1 with equal probability, chosen separately for every run as in experiment 2 in [33]. The rows of were iid -dimensional vectors . We considered auto-regressive structure of covariance matrix that is \Xi=\big{(}\rho^{|i-j|}\big{)}_{i,j=1}^{p} for . The columns of were centred and normalized so that and . The plan of experiments is presented in Table 1 with SNR meaning a Signal to Noise Ratio.
For every experiment the results were based on simulation runs.
The output grid was chosen as where We reported the mean model dimension (MD) that is and the mean squared prediction error (PE) on new data set with 1000 observations equalling , where is the post-selection OLS estimator. Values of for the models chosen by GIC=GIC() were calculated and averaged over simulations.
The results are presented in two first columns of Figure 1. The two vertical lines indicate models chosen using GIC with : the black one for SSnet and the red one for SparseNet. The blue vertical line denotes the true model dimension.
For logistic models we compared the performance of two algorithms: SSnet and MCP implemented in the R package ncvreg for the default value of and 100 values of . As for linear models, SSnet called the R package glmnet [10] to compute the Lasso estimators for 20 lambdas on a default log scale. We performed experiments very similar to those for linear models, changing only and the number of simulation runs to . The plan of experiments is shown in the Table 2. Random samples were generated according to the binomial distribution. We reported prediction error defined as misclassification frequency on new data set with 1000 observations. The results organized in a similar way as for the linear models are shown in columns 3–4 of Figure 1. The two vertical lines indicate models chosen using GIC with , the black one for SSnet and the red one for ncvreg.
Summarizing the results of the simulation study, one can observe that SSnet for linear models turned out to have equal or lower PE in almost all of the experimental setups. The differences are most visible in setups with autocorrelation structure with . The value in GIC usually gave satisfactory results. The mean execution time of SSnet was approximately 3 times faster than for SparseNet. SSnet for logistic regression gave higher accuracy as ncvreg, but execution time of SSnet was approximately 10 times longer than ncvreg. The value in GIC usually gave satisfactory results.
4.2 Real Data Sets
The methylation data set was described in [12]. It consist of the age of 656 human individuals together with values of phenotypic features such as gender and body mass index and of genetic features, which are methylation states of 485 577 CpG markers. Methylation was recorded as a fraction representing the frequency of methylation of a given CpG marker across the population of blood cells taken from a single individual. In our comparison we used only genetic features from which we extracted 193 870 most relevant CpGs according to onefold t-tests with Benjamini-Hochberg adjustment, FDR=.05. We compared the root mean squared errors (PE) and model dimensions (MD) for SSnet and SparseNet. To calculate PE we used 10-fold cross-validation. For each value of hyperparameter values of for the models chosen by GIC=GIC() were calculated and averaged over 10 folds. The results are presented in Figure 2. SparseNet yields a path of models for each value of parameter . We present results for , corresponding to the Lasso, for , close to the best subset, and for an intermediate value in Figure 2. Remarkably, SSnet gives uniformly smaller PE than SparseNet for all . The two vertical lines indicate models chosen using GIC with : the black one for SSnet and the red one for SparseNet.
A logistic model was fitted to the breast cancer data described in [11] which concerns small, invasive carcinomas without axillary lymph node involvement to predict metastasis of small node-negative breast carcinoma. There were 168 patients: 111 with no event after diagnosis labeled as good, and 57 with early metastasis labeled as poor. The number of predictors in this data was 2905. We compared the mean errors of binary prediction (PE) and model dimensions (MD) for SSnet and ncvreg. For each of 80 hyperparameters values of for the models chosen by were calculated and averaged over 10 folds. Again to calculate PE we used 10-fold cross-validation. The results are presented in Figure 3. It is hard to find the minimal of PE for ncvreg. If we increased the net of , maybe we would obtain a smaller PE for ncvreg, but the model would be significantly larger than for SSnet. The algorithms work comparably, but again SSnet was 3 times longer. The two vertical lines indicate models chosen using GIC with : the black one for SSnet and the red one for ncvreg.
5 Conclusions
In the paper we propose the SS algorithm which is an alternative method to TL and FCP of improving the Lasso. Our approach encompasses fundamental models for prediction of continuous as well as of binary responses and the main results are stated jointly for both of them. Its assumptions are stated in the most general form which allows proving exponential bound without obscuring the essence of the results and comparing the bounds for both models. By simplifying SOS to SS we were able to simplify reasoning used for SOS and then extend them from normal linear models to general predictive models.
We propose an algorithm SSnet, which is a generalization of the SS algorithm for general predictive models. Using a net of parameters, SSnet avoids the problem of choosing one specific . The gap between theoretical results for SS and the SSnet algorithm is similar to the difference between theory for FCP and it implementations SparseNet or ncvreg. Numerical experiments reveal that for normal linear models SSnet is more accurate than SparseNet and three times faster, whereas for logistic models performance of SSnet is also better than the performance of ncvreg with computing times 3-10 times longer.
We have shown in simulations (dotted vertical lines in Figure 1) that predictively optimal for normal linear models equals approximately , which is close to (18) and for logistic models is , which together with (19) suggests that . The relations between the safest choice discussed in Remark 1 and predictively optimal are important applications of our theory.
In the package ,,DMRnet” one can find the SOSnet algorithm, which is a net version of SOS from [23]. SOSnet can be viewed as SSnet with additional step (,,Ordering”) based on refitting estimators and calculating Wald statistics. Results of our numerical experiments, which are not presented here, suggest that SOSnet can improve the quality of SSnet, especially in the linear model.
Appendix A Proofs and auxiliary results
In the following subsections we present proofs of the results stated in the paper.
A.1 Proof of Lemma 2
Let and . From Markov’s inequality we obtain
[TABLE]
Minimizing the last expression w.r.t. gives part (i).
Let be a vector of iid standard normal errors independent of . We have
[TABLE]
Thus part (ii) follows from a known formula for the moment generating function of the distribution.
From Markov’s inequality and the part (ii) of this lemma we have
[TABLE]
Thus, minimizing the last expression w.r.t we obtain part (iii). ∎
A.2 Proof of Lemma 1
Let
[TABLE]
and . We have and from the Karush-Kuhn-Tucker (KKT) theorem we obtain equations
[TABLE]
Let and be such that . We have for and consequently
[TABLE]
Then letting for we have . Moreover, for we have from convexity of that
[TABLE]
Indeed, is the symmetrized Bregman divergence [14]. Hence . Thus, on , and from the definition of we obtain using KKT again
[TABLE]
∎
A.3 Proof of Theorem 1
First, we will prove that for defined in (33). From Lemma 1 and assumptions we have on
[TABLE]
where we recall that Thus using (37) twice we have for and
[TABLE]
and it follows that . Morover, the assumptions of this theorem imply
[TABLE]
Hence, using Lemma 2 (i) we easily obtain
[TABLE]
∎
A.4 Proof of Theorem 2
Let us observe that the consecutive steps of the SS algorithm constitute a decomposition of the selection error into two parts:
[TABLE]
Therefore, Theorem 2 follows easily from (40) and (44) below.
Having in mind that for given we let , and , by arguments similar to those in the proof of Theorem 1 we obtain . Moreover, assumptions and imply
[TABLE]
As a result
[TABLE]
Now we bound \mathbb{P}\big{(}T\in{\cal J},\hat{T}\neq T\big{)}. In Lemma 4 and 5, given below, we bound probability that in the second step of the SS algorithm GIC chooses a subset of the true set, i.e.
[TABLE]
or a superset of i.e.
[TABLE]
These lemmas state that both components of the selection error set are included in the critical sets of the following form
[TABLE]
or
[TABLE]
where is such that We consider only supersets that because GIC corresponding to the superset such that is larger than GIC corresponding to a superset such that and
Let us define and . Under our assumptions we have . Let for . Of course is increasing, and for . Consequently , which means that
[TABLE]
From Lemma 4, Lemma 5, Lemma 2 (iii) and (43) we get
[TABLE]
Using for and the fact that probability is not greater than we obtain
[TABLE]
For , assumption implies , therefore
[TABLE]
that finishes the proof of Theorem 2. ∎
Before we state Lemma 4 and 5 we introduce few notations. For and in (5) we define
[TABLE]
First, we prove the following lemma which will be used in the proof of Lemma 4.
Lemma 3**.**
For we have
[TABLE]
Proof.
For from assumption (11), Schwartz inequality and properties of the orthogonal projection we get
[TABLE]
Since the last expression does not depend on , we have for
[TABLE]
Let us notice that for such that we have , so . Since is convex and we obtain . ∎
Lemma 4**.**
If for we have \lambda^{2}<c\delta\big{/}\big{(}1+\sqrt{2a}\big{)}^{2}, then
Proof.
From assumption of this lemma , so . Hence for
[TABLE]
Moreover, if then
[TABLE]
because for . The last inequality is true, if
[TABLE]
Indeed, it is easy to check that (48) follows from the assumption as
[TABLE]
Finally from (46), Lemma 3 and (47) we obtain, respectively
[TABLE]
∎
Lemma 5**.**
For we have
[TABLE]
Proof.
For such that define and . Notice that the event can be decomposed as
[TABLE]
where For we have from assumption (11) and properties of orthogonal projection
[TABLE]
so . The second event in the sum (50) is obviously contained in Now, we show that this event is also contained in . To do it, we use argumentation similar to van de Geer, [28]. Thus, we define
[TABLE]
The random vector belongs to because
[TABLE]
Using convexity of the loss function we have
[TABLE]
Using this fact as well as assumption (11), Schwartz inequality and properties of the orthogonal projection we get
[TABLE]
It gives us
[TABLE]
Therefore, we obtain
[TABLE]
From assumptions of Theorem 2 we know that that gives us .
On the other hand, , hence we obtain for and
[TABLE]
Finally
[TABLE]
∎
A.5 Proof of Theorem 3
The idea of the proof is similar to the proof of Theorem 2. However, here we consider the linear model with the subgaussian noise that allows us to apply simpler and better arguments than in the general case of GLM. Indeed, while working with the event we can use the result for the scalar product (Lemma 2 (i)) instead of the one for quadratic forms (Lemma 2 (ii)). Besides, when considering we need to upper bound probability of instead of Obviously, probability of the former event is smaller.
As previously, we work with the error decomposition (39). Probability of the event can be bounded in the same way as in the proof of Theorem 2, if we put . However, working with subsets and supersets of is different.
We start with subsets of For we denote
[TABLE]
and
[TABLE]
Notice that is equivalent to
[TABLE]
Whence
[TABLE]
where and by the definiton of Thus, from Lemma 2 (i) we have
[TABLE]
For the penultimate inequality we use again the inequality for and the fact that probability is not greater than . For the last inequality above we used
[TABLE]
which is satisfied for
[TABLE]
where
[TABLE]
It is easy to check that (52) follows from the assumption as
[TABLE]
Now, we consider supersets of As previously we consider only such supersets that We have . Using Lemma 2 (iii), as in the proof of Theorem 2, we obtain
[TABLE]
Finally, we get the following inequalities
[TABLE]
∎
A.6 Proof of Theorem 4
In section 3 we state Theorem 4 with simplified constants Its general form is given below:
Fix Assume that (25), (26), (27) and
[TABLE]
Then
[TABLE]
The main difference between proofs of Theorems 2 and 4 is that here we investigate properties of the expected loss instead of the loss It relates to the fact that the former function is more regular. To be more precise, in many parts of the proof we work with expressions of the form
[TABLE]
where is a random vector contained in some ball Clearly, we can transform (55) into
[TABLE]
The first term in (56) can be handled using the regularity of the function given in assumption (25), while to work with the latter we apply assumptions (26) or (27).
As previously, we work with the error decomposition (39) and start with bounding probability of Take and define From (54) we have Consider the following event
[TABLE]
which has probability not less than
[TABLE]
Indeed, if we take and in (26), then using again (54) we obtain and (58) follows.
Similarly to the proof of Bühlmann and van de Geer, [4, Theorem 6.4], we can show that the event (57) implies that
[TABLE]
From (59) and (54) we can obtain separability of the Lasso. Namely, for each that
[TABLE]
Now we consider probability that GIC chooses a submodel of in the second step of the SS algorithm. Using the definition (45) we obtain
[TABLE]
Fix and such that We take arbitrary and recall that It is clear that so we can use (25) for and obtain
[TABLE]
Since the last expression does not depend on we have
[TABLE]
Proceeding as in the proof of Lemma 3 we obtain
[TABLE]
If we take then
[TABLE]
From (54) we have (as in the proof of Lemma 4)
[TABLE]
Take
[TABLE]
which is not smaller than one by (60) and (54). Thus, using (27) with and (60) we have
[TABLE]
Therefore,
[TABLE]
because by (53).
Next, we consider choosing a supermodel of by GIC. Fix the set and denote Therefore, we have
[TABLE]
because and We will prove that is contained in
[TABLE]
for The event can be decomposed as
[TABLE]
It is clear that the first event on the right-hand side of (62) is contained in (61) and the second one is contained in Now we prove that is also contained in (61). Our argumentation is similar to the proof of Lemma 5, but here we take Therefore, we obtain
[TABLE]
Moreover, the following bound
[TABLE]
is implied by (25), because This inclusion comes from (54), because
[TABLE]
[TABLE]
and by the choice of Therefore, we have just proved that is contained in (61). To finish the proof we need sharp enough upper bound of (61). It can be obtained using (27) with that gives us
[TABLE]
Notice that because using (54)
[TABLE]
Thus, we obtain the following bound on probability of choosing a supermodel
[TABLE]
where we use the fact that from (54) and two inequalities and for ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bartlett et al., [2006] Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. (2006). Convexity, classification and risk bounds. Journal of the American Statistical Association , 101:138–156.
- 2Bickel et al., [2009] Bickel, P., Ritov, Y., and Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics , 37:1705–1732.
- 3Breheny and Huang, [2011] Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics , 5:232–253.
- 4Bühlmann and van de Geer, [2011] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-dimensional Data . Springer, New York.
- 5Efron et al., [2004] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics , 32:407–499.
- 6Fan and Li, [2001] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association , 96:1348–1360.
- 7Fan et al., [2014] Fan, J., Xue, L., and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics , 42:819–849.
- 8Fan and Tang, [2013] Fan, Y. and Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society. Series B (Statistical Methodology) , 75:531–552.
