Improving Lasso for model selection and prediction

Piotr Pokarowski; Wojciech Rejchel; Agnieszka Soltys; Michal Frej and; Jan Mielniczuk

arXiv:1907.03025·math.ST·January 26, 2021

Improving Lasso for model selection and prediction

Piotr Pokarowski, Wojciech Rejchel, Agnieszka Soltys, Michal Frej and, Jan Mielniczuk

PDF

TL;DR

This paper introduces a new, computationally efficient method to improve Lasso for model selection across various convex loss functions, demonstrating superior accuracy and theoretical consistency compared to existing approaches.

Contribution

The paper proposes a novel model selection algorithm based on ordering Lasso coefficients and using the Generalized Information Criterion, applicable to diverse models and free from unknown parameter reliance.

Findings

01

The method achieves consistent model selection under weak assumptions.

02

Numerical experiments show improved accuracy over concave regularizations.

03

The algorithm is implemented in the R package "DMRnet" and outperforms existing methods.

Abstract

It is known that the Thresholded Lasso (TL), SCAD or MCP correct intrinsic estimation bias of the Lasso. In this paper we propose an alternative method of improving the Lasso for predictive models with general convex loss functions which encompass normal linear models, logistic regression, quantile regression or support vector machines. For a given penalty we order the absolute values of the Lasso non-zero coefficients and then select the final model from a small nested family by the Generalized Information Criterion. We derive exponential upper bounds on the selection error of the method. These results confirm that, at least for normal linear models, our algorithm seems to be the benchmark for the theory of model selection as it is constructive, computationally efficient and leads to consistent model selection under weak assumptions. Constructivity of the algorithm means that, in…

Tables2

Table 1. Table 1: Plan of experiments for linear models.

	n	p	$\overset{̊}{β}$	$ρ$	$σ^{2}$	SNR
N.1.5	100	3000	${\overset{̊}{β}}^{(1)}$	.5	4	2.3
N.1.7	100	3000	${\overset{̊}{β}}^{(1)}$	.7	4	2.6
N.1.9	100	3000	${\overset{̊}{β}}^{(1)}$	.9	4	3
N.2.5	200	2000	${\overset{̊}{β}}^{(2)}$	.5	7	2.4
N.2.7	200	2000	${\overset{̊}{β}}^{(2)}$	.7	7	2.3
N.2.9	200	2000	${\overset{̊}{β}}^{(2)}$	.9	7	2.2

Table 2. Table 2: Plan of experiments for logistic models.

	n	p	$\overset{̊}{β}$	$ρ$
B.1.5	300	3000	${\overset{̊}{β}}^{(1)}$	.5
B.1.7	300	3000	${\overset{̊}{β}}^{(1)}$	.7
B.1.9	300	3000	${\overset{̊}{β}}^{(1)}$	.9
B.2.5	500	2000	${\overset{̊}{β}}^{(2)}$	.5
B.2.7	500	2000	${\overset{̊}{β}}^{(2)}$	.7
B.2.9	500	2000	${\overset{̊}{β}}^{(2)}$	.9

Equations277

λ = \frac{2 σ ^{2} lo g p}{n} (1 + o (1))

λ = \frac{2 σ ^{2} lo g p}{n} (1 + o (1))

\frac{lo g p}{δ} = o (1) and \frac{t}{β ˚ _{min}} \frac{lo g p}{n} = o (1),

\frac{lo g p}{δ} = o (1) and \frac{t}{β ˚ _{min}} \frac{lo g p}{n} = o (1),

\frac{lo g p}{δ} = O (1) and \frac{1}{β ˚ _{min}} \frac{lo g p}{n} = O (1),

\frac{lo g p}{δ} = O (1) and \frac{1}{β ˚ _{min}} \frac{lo g p}{n} = O (1),

E y_{i} = \overset{γ}{˙} (x_{i}^{T} \overset{˚}{β}) for i = 1, 2, \dots, n,

E y_{i} = \overset{γ}{˙} (x_{i}^{T} \overset{˚}{β}) for i = 1, 2, \dots, n,

δ_{k} = J \subset T, ∣ T ∖ J ∣ = k min ∣∣ (I - H_{J}) X_{T} \overset{˚}{β}_{T} ∣ ∣^{2},

δ_{k} = J \subset T, ∣ T ∖ J ∣ = k min ∣∣ (I - H_{J}) X_{T} \overset{˚}{β}_{T} ∣ ∣^{2},

B \equiv B (X, \overset{˚}{β}, \overset{ˉ}{t}) = J \supset T, r (X_{J}) = ∣ J ∣ \leq \overset{ˉ}{t} ⋃ {β_{J} : ∣∣ X_{J} (\overset{˚}{β}_{J} - β_{J}) ∣ ∣^{2} \leq δ_{t - 1}} .

B \equiv B (X, \overset{˚}{β}, \overset{ˉ}{t}) = J \supset T, r (X_{J}) = ∣ J ∣ \leq \overset{ˉ}{t} ⋃ {β_{J} : ∣∣ X_{J} (\overset{˚}{β}_{J} - β_{J}) ∣ ∣^{2} \leq δ_{t - 1}} .

g (β) = i = 1 \sum n γ (x_{i}^{T} β)

g (β) = i = 1 \sum n γ (x_{i}^{T} β)

g (β) \geq g (\overset{˚}{β}) + (β - \overset{˚}{β})^{T} \overset{g}{˙} (\overset{˚}{β}) + \frac{c}{2} (\overset{˚}{β} - β)^{T} X^{T} X (\overset{˚}{β} - β) .

g (β) \geq g (\overset{˚}{β}) + (β - \overset{˚}{β})^{T} \overset{g}{˙} (\overset{˚}{β}) + \frac{c}{2} (\overset{˚}{β} - β)^{T} X^{T} X (\overset{˚}{β} - β) .

E exp (u ε_{i}) \leq exp (σ^{2} u^{2} /2) .

E exp (u ε_{i}) \leq exp (σ^{2} u^{2} /2) .

y_{i} = x_{i}^{T} \overset{˚}{β} + ε_{i}, i = 1, 2, \dots, n,

y_{i} = x_{i}^{T} \overset{˚}{β} + ε_{i}, i = 1, 2, \dots, n,

P (y_{i} = 1) = \frac{exp ( x _{i}^{T} β ˚ )}{exp ( x _{i}^{T} β ˚ ) + 1}, i = 1, \dots, n .

P (y_{i} = 1) = \frac{exp ( x _{i}^{T} β ˚ )}{exp ( x _{i}^{T} β ˚ ) + 1}, i = 1, \dots, n .

c = i min β \in B min exp (x_{i}^{T} β) / (1 + exp (x_{i}^{T} β))^{2},

c = i min β \in B min exp (x_{i}^{T} β) / (1 + exp (x_{i}^{T} β))^{2},

ℓ (β) = i = 1 \sum n [γ (x_{i}^{T} β) - y_{i} x_{i}^{T} β] = g (β) - β^{T} X^{T} y,

ℓ (β) = i = 1 \sum n [γ (x_{i}^{T} β) - y_{i} x_{i}^{T} β] = g (β) - β^{T} X^{T} y,

ℓ (β) \geq ℓ (\overset{˚}{β}) + (\overset{˚}{β} - β)^{T} X^{T} ε + \frac{c}{2} (\overset{˚}{β} - β)^{T} X^{T} X (\overset{˚}{β} - β) .

ℓ (β) \geq ℓ (\overset{˚}{β}) + (\overset{˚}{β} - β)^{T} X^{T} ε + \frac{c}{2} (\overset{˚}{β} - β)^{T} X^{T} X (\overset{˚}{β} - β) .

{\cal C}_{a}=\bigg{\{}\nu\in\mathbb{R}^{p}:|\nu_{\bar{T}}|_{1}\leq\frac{1+a}{1-a}|\nu_{T}|_{1},\,\nu_{j}x_{.j}^{T}\big{[}\vec{\dot{\gamma}}(X(\mathring{\beta}+\nu))-\vec{\dot{\gamma}}(X\mathring{\beta})\big{]}\leq 0,\,j\in\bar{T}\bigg{\}}.

{\cal C}_{a}=\bigg{\{}\nu\in\mathbb{R}^{p}:|\nu_{\bar{T}}|_{1}\leq\frac{1+a}{1-a}|\nu_{T}|_{1},\,\nu_{j}x_{.j}^{T}\big{[}\vec{\dot{\gamma}}(X(\mathring{\beta}+\nu))-\vec{\dot{\gamma}}(X\mathring{\beta})\big{]}\leq 0,\,j\in\bar{T}\bigg{\}}.

\zeta_{a,q}=\inf_{0\neq\nu\in{\cal C}_{a}}\frac{\big{|}X^{T}\big{[}\vec{\dot{\gamma}}(X(\mathring{\beta}+\nu))-\vec{\dot{\gamma}}(X\mathring{\beta})\big{]}\big{|}_{\infty}}{|\nu|_{q}}\>.

\zeta_{a,q}=\inf_{0\neq\nu\in{\cal C}_{a}}\frac{\big{|}X^{T}\big{[}\vec{\dot{\gamma}}(X(\mathring{\beta}+\nu))-\vec{\dot{\gamma}}(X\mathring{\beta})\big{]}\big{|}_{\infty}}{|\nu|_{q}}\>.

∣ \hat{β} - \overset{˚}{β} ∣_{q} \leq (1 + a) λ ζ_{a, q}^{- 1}

∣ \hat{β} - \overset{˚}{β} ∣_{q} \leq (1 + a) λ ζ_{a, q}^{- 1}

2 a_{1}^{- 2} a_{2}^{- 1} σ^{2} lo g p \leq λ^{2} \leq (1 + a_{1})^{- 2} ζ_{a_{1}}^{2} τ^{2} < (1 + a_{1})^{- 2} ζ_{a_{1}}^{2} \overset{˚}{β}_{min}^{2} /4,

2 a_{1}^{- 2} a_{2}^{- 1} σ^{2} lo g p \leq λ^{2} \leq (1 + a_{1})^{- 2} ζ_{a_{1}}^{2} τ^{2} < (1 + a_{1})^{- 2} ζ_{a_{1}}^{2} \overset{˚}{β}_{min}^{2} /4,

\mathbb{P}(\hat{T}_{TL}\neq T)\leq 2\exp\bigg{(}-\frac{(1-a_{2})a_{1}^{2}\lambda^{2}}{2\sigma^{2}}\bigg{)}.

\mathbb{P}(\hat{T}_{TL}\neq T)\leq 2\exp\bigg{(}-\frac{(1-a_{2})a_{1}^{2}\lambda^{2}}{2\sigma^{2}}\bigg{)}.

δ = J \subset T min \frac{∣∣ ( I - H _{J} ) X _{T} β ˚ _{T} ∣ ∣ ^{2}}{∣ T ∖ J ∣}

δ = J \subset T min \frac{∣∣ ( I - H _{J} ) X _{T} β ˚ _{T} ∣ ∣ ^{2}}{∣ T ∖ J ∣}

\frac{2 σ ^{2} lo g p}{a _{3} a _{2} a _{1} c} \lor \frac{σ ^{2} t}{( 1 - a _{1} ) ^{2} c} \leq λ^{2} < \frac{c δ _{t - 1}}{16 ( t ˉ - t )} \land \frac{cδ}{( 1 + 2 ( 1 - a _{1} ) ) ^{2}} \land \frac{ζ _{a_{4}}^{2} β ˚ _{min}^{2}}{4 ( 1 + a _{4} ) ^{2}} .

\frac{2 σ ^{2} lo g p}{a _{3} a _{2} a _{1} c} \lor \frac{σ ^{2} t}{( 1 - a _{1} ) ^{2} c} \leq λ^{2} < \frac{c δ _{t - 1}}{16 ( t ˉ - t )} \land \frac{cδ}{( 1 + 2 ( 1 - a _{1} ) ) ^{2}} \land \frac{ζ _{a_{4}}^{2} β ˚ _{min}^{2}}{4 ( 1 + a _{4} ) ^{2}} .

\mathbb{P}(\hat{T}_{SS}\neq T)\leq 4.5\exp\bigg{(}-\frac{a_{2}(1-a_{1})c\lambda^{2}}{2\sigma^{2}}\bigg{)}.

\mathbb{P}(\hat{T}_{SS}\neq T)\leq 4.5\exp\bigg{(}-\frac{a_{2}(1-a_{1})c\lambda^{2}}{2\sigma^{2}}\bigg{)}.

n \overline{lim} \frac{2 σ ^{2} lo g p / c}{\frac{c δ _{t - 1}}{16 ( t ˉ - t )} \land ( cδ ) \land ( ζ _{a_{4}}^{2} β ˚ _{min}^{2} /16 )} < 1, then P (\hat{T}_{S S} \neq = T) = o (1) .

n \overline{lim} \frac{2 σ ^{2} lo g p / c}{\frac{c δ _{t - 1}}{16 ( t ˉ - t )} \land ( cδ ) \land ( ζ _{a_{4}}^{2} β ˚ _{min}^{2} /16 )} < 1, then P (\hat{T}_{S S} \neq = T) = o (1) .

λ = 2 σ^{2} lo g p (1 + o (1))

λ = 2 σ^{2} lo g p (1 + o (1))

λ = (lo g p) / (2 c) (1 + o (1)),

λ = (lo g p) / (2 c) (1 + o (1)),

\frac{2 σ ^{2} lo g p}{a ^{4}} \leq λ^{2} < \frac{ζ _{a}^{2} β ˚ _{min}^{2} \land δ}{4 ( 1 + a ) ^{2}} .

\frac{2 σ ^{2} lo g p}{a ^{4}} \leq λ^{2} < \frac{ζ _{a}^{2} β ˚ _{min}^{2} \land δ}{4 ( 1 + a ) ^{2}} .

P(\hat{T}_{SS}\neq T)\leq 5\exp\bigg{(}-\frac{a^{2}(1-a^{2})\lambda^{2}}{2\sigma^{2}}\bigg{)}.

P(\hat{T}_{SS}\neq T)\leq 5\exp\bigg{(}-\frac{a^{2}(1-a^{2})\lambda^{2}}{2\sigma^{2}}\bigg{)}.

\frac{lo g p}{δ} = o (1) and \frac{t lo g p}{β ˚ _{min}} = o (1) .

\frac{lo g p}{δ} = o (1) and \frac{t lo g p}{β ˚ _{min}} = o (1) .

\frac{lo g p}{δ} = O (1) and \frac{lo g p}{β ˚ _{min}} = O (1) .

\frac{lo g p}{δ} = O (1) and \frac{lo g p}{β ˚ _{min}} = O (1) .

P (ε^{T} ν /∣∣ ν ∣∣ \geq τ) \leq exp (- \frac{τ ^{2}}{2 σ ^{2}}),

P (ε^{T} ν /∣∣ ν ∣∣ \geq τ) \leq exp (- \frac{τ ^{2}}{2 σ ^{2}}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Improving Lasso for Model Selection and Prediction

Piotr Pokarowski 111Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland, [email protected]. The research was supported by the Polish National Science Center grant no. 2015/17/B/ST6/01878. ,Wojciech Rejchel 222Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Chopina 12/18, 87-100, Toruń, Poland, [email protected]. The research was supported by the Polish National Science Center grant no. 2015/17/B/ST6/01878 , Agnieszka Sołtys 333Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland, [email protected],

Michał Frej444Institute of Applied Mathematics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland, [email protected], Jan Mielniczuk 555Institute of Computer Sciences, Polish Academy of Sciences, Jana Kazimierza 5, 01-248 Warsaw, Poland, [email protected]

Abstract

It is known that the Thresholded Lasso (TL), SCAD or MCP correct intrinsic estimation bias of the Lasso. In this paper we propose an alternative method of improving the Lasso for predictive models with general convex loss functions which encompass normal linear models, logistic regression, quantile regression or support vector machines. For a given penalty we order the absolute values of the Lasso non-zero coefficients and then select the final model from a small nested family by the Generalized Information Criterion. We derive exponential upper bounds on the selection error of the method. These results confirm that, at least for normal linear models, our algorithm seems to be the benchmark for the theory of model selection as it is constructive, computationally efficient and leads to consistent model selection under weak assumptions. Constructivity of the algorithm means that, in contrast to the TL, SCAD or MCP, consistent selection does not rely on the unknown parameters as the cone invertibility factor. Instead, our algorithm only needs the sample size, the number of predictors and an upper bound on the noise parameter. We show in numerical experiments on synthetic and real-world data sets that an implementation of our algorithm is more accurate than implementations of studied concave regularizations. Our procedure is contained in the R package ,,DMRnet” and available on the CRAN repository.

List of keywords: convex loss function, empirical process, generalized information criterion, high-dimensional regression, penalized estimation, selection consistency

1 Introduction

Sparse high-dimensional predictive models, where the number of true predictors $t$ is significantly smaller than the sample size $n$ and the number of all predictors $p$ greatly exceeds $n,$ have been a focus of research in statistical machine learning in recent years. The Lasso algorithm, that is the minimum loss method regularized by sparsity inducing the $\ell_{1}$ penalty, is the main tool of fitting such models [27, 4]. However, it has been shown that the model selected by the Lasso is usually too large and that for asymptotically consistent model selection it requires the irrepresentable condition on an experimental matrix [22, 39, 25] which is too restrictive in general. The model’s dimension can be reduced without loss of the quality using the Thresholded Lasso (TL) algorithm, which selects variables with largest absolute values of the Lasso coefficients [34, 40] or by solving more computationally demanding minimization of a loss with a folded concave penalty (FCP) as SCAD [6], MCP [35] or capped $l_{1}$ -penalty [37, 25]. TL, FCP and similar methods lead to consistent selection under weaker assumptions such as the restricted isometry property [35, 37, 36, 25, 32, 7, 33]. In [23] one introduced an algorithm called Screening–Ordering–Selection (SOS) for linear model selection, which reduces successfully the model selected by the Lasso. SOS is based on the variant of TL proposed by [40] and leads to consistent model selection under assumptions similar to the restricted isometry property.

In the paper we consider two algorithms that improve the Lasso, i.e. they are model selection consistent under weaker assumptions than the Lasso. Moreover, the considered procedures are computationally simpler than FCP methods that use non-convex penalties. The first algorithm is the well-known Thresholded Lasso. Its model selection consistency in the normal linear model is proven in [34, Theorem 8] provided that conditions, which seem to be ,,minimal”, are satisfied. Our first contribution is an extension of this result (given in Theorem 1) to Generalized Linear Models (GLM). However, the TL algorithm is not constructive, because it is not known how to choose the threshold. Therefore, in the current paper we propose the second improvement of the Lasso which is the Screening–Selection (SS) algorithm. It is a two-step procedure: in the first step (screening) one computes the Lasso estimator $\hat{\beta}$ with penalty $\lambda$ and orders its nonzero coefficients according to their decreasing absolute values. In the second step (selection) one chooses the model which minimizes the Generalized Information Criterion (GIC) with penalty $\lambda^{2}/2$ in a nested family induced by the ordering. Thus, the SS algorithm (Algorithm 1 below) can be viewed as the Lasso with adaptive thresholding based on GIC. We prove that this procedure is model selection consistent in a wide class of models containing linear models with the subgaussian noise (Theorem 3), GLM (Theorem 2) and models with convex (possibly nondifferentiable) loss functions (Theorem 4) as in quantile regression or support vector machines. The obtained results are exponential upper bounds on the selection error of SS in terms of $\lambda$ , which parallel the known bounds for TL [34, Theorem 8] or FCP [7, Corollary 3 and 5]. For GLM our results are obtained on the basis of exponential inequalities for subgaussian random variables. In the case of predictive models with general convex loss functions we use methods from the empirical process theory.

The SS algorithm is a simplification and a generalization of the Screening–Ordering–Selection (SOS) algorithm from [23] that was proposed only for normal linear models. We show that the ordering step in SOS can be done using separability of the Lasso instead of using $t$ -statistics. Separability means that, under mild assumptions, coordinates of the Lasso corresponding to relevant predictors are larger in absolute values than those corresponding to irrelevant predictors. Moreover, we establish that the SS algorithm is model selection consistent beyond normal linear models. The new procedure can be applied to the various statistical predictive models including models with quantitative or qualitative response variables. We can use ”classical” models (the normal linear model, the logistic model) as well as modern problems involving piecewise-linear loss functions (as in quantile regression or support vector machines).

Our results state that, in contrast to TL and FCP methods, SS is constructive in the linear model with the subgaussian noise, because it does not rely on the unknown parameters as the true vector $\mathring{\beta}$ or the cone invertibility factors. Indeed, we will establish in Section 2 that the choice of the tuning parameter $\lambda$ as

[TABLE]

leads to model selection consistency of the SS algorithm. Therefore, $\lambda$ only depends on $n$ , $p$ and $\sigma^{2}$ that is an upper bound on the noise parameter. The assumption that $\sigma^{2}$ is known is common in the literature investigating theoretical properties of variable selection procedures [34, 4, 7].

Moreover, we will also prove that model selection consistency of the SS algorithm holds under weaker conditions than for competitive procedures. For instance, the algorithm of [32] requires that

[TABLE]

where $\delta$ is a scaled Kullback-Leibler distance between the true set and its submodels (defined in (15)) and $\mathring{\beta}_{min}$ is the minimal signal strength. However, for the SS procedure we need that

[TABLE]

which, among others, weakens significantly the beta-min condition. More detailed description of the SS algorithm and comparisons to other procedures is given in Section 2. This analysis enables us to claim that for subgaussian linear models the SS algorithm seems to be the benchmark for the theory of model selection as it is constructive, computationally efficient and leads to consistent model selection under weak assumptions. Similarly to TL or FCP, the SS algorithm becomes non-constructive, if we go beyond the linear model with the subgaussian noise. Notice that in (1), (2) and (3) we assume, similarly to [32], that predictors are scaled to have the $l_{2}$ -norm equal to $\sqrt{n}.$ However, in the rest of the paper we will use more convenient notation that the $l_{2}$ -norm of predictors is one. Therefore, the expression $\sqrt{n}$ will not appear in analogs of (1), (2) and (3) in Section 2.

Although TL, FCP or SS algorithms use the Lasso estimators only for one value of the penalty, which is convenient for theoretical analysis, the practical Lasso implementations return coefficient estimators for all possible penalty values as in the R package LARS described in Efron et al., [5] or for a given net of them as in the R package glmnet described in Friedman et al., [10]. Similarly, using a net of penalty values, the FCP algorithm has been implemented for linear models in the R package SparseNet [21]) and for logistic models in the R package ncvreg [3]. Our contribution is also the SSnet algorithm (Algorithm 2 below), which is a practical version of the SS algorithm. SSnet uses glmnet to calculate the Lasso for a net of penalty values, then again it selects the final model from a small family by minimizing GIC. In numerical experiments we investigate properties of SSnet in model selection as well as prediction and compare them to competitive procedures. Using synthetic and real-world data sets we show that SSnet is more accurate than implementations of FCP. The variant of SSnet is contained in the R package ,,DMRnet” and available on the CRAN repository.

GIC is a popular procedure in choosing the final model in variable selection. In the literature there are many papers investigating its properties, for instance [18, 32] in linear models, [8, 15] in GLM, [16] in multivariate linear regression, [38] for SVM and [17] for general convex loss functions. GIC is often applied to pathwise algorithms under a common assumption that the true model is on this path. This condition excludes Lasso-based pathwise algorithms as it is only fulfilled under restrictive assumptions. One can overcome this problem using a three-step procedures, for instance the Lasso and non-convex penalized regression [32] or the Lasso and thresholding [17]. However, it makes the algorithm computationally more complex or one has to find a threshold that recovers the true model on the path, respectively. In contrast, the first step of the proposed procedure is related only to the Lasso. Indeed, we need only that the model with correctly separated predictors is on the path. This is guaranteed for the Lasso under mild assumptions. Therefore, it makes our procedure simpler and computationally more efficient.

The paper is organized as follows: in the next section we describe subgaussian GLM and algorithms that we work with. Moreover, we establish bounds for the selection error of the proposed procedures. These results are extended to models with general convex contrasts in Section 3. In Section 4 we investigate properties of estimators on simulated and real data sets. The paper is concluded in Section 5. All proofs and auxiliary results are relegated to the appendix.

2 Subgaussian Generalized Linear Models

In this section we start with definitions of considered models and estimation criteria. Next, we present model selection algorithms and state exponential upper bounds on their selection errors.

2.1 Models

The way we model data will encompass normal linear and logistic models as premier examples. Our assumptions are stated in their most general form which allows proving exponential bounds for probability of the selection error without obscuring their essentiality. In the paper we consider independent data $(y_{1},x_{1}),(y_{2},x_{2}),\ldots,(y_{n},x_{n})$ , where $y_{i}\in\mathbb{R},~{}x_{i}\in\mathbb{R}^{p}$ for $i=1,2,\dots,n$ . We assume that for some true $\mathring{\beta}\in\mathbb{R}^{p}$ and a known differentiable cumulant function $\gamma:\mathbb{R}\rightarrow\mathbb{R}$

[TABLE]

where $\dot{\gamma}$ denotes the derivative of $\gamma.$ Note that (4) is satisfied in particular by the Generalized Linear Models (GLM) with a canonical link function. and a nonlinear regression with an additive error. Let $\eta_{i}=x_{i}^{T}\mathring{\beta}$ . For $\eta=(\eta_{1},\ldots,\eta_{n})^{T}$ we define $\vec{\gamma}(\eta)=(\gamma(\eta_{1}),\ldots,\gamma(\eta_{n}))^{T}$ and similarly $\vec{\dot{\gamma}}(\eta)=(\dot{\gamma}(\eta_{1}),\ldots,\dot{\gamma}(\eta_{n}))^{T}$ .

Let $X=[x_{.1},\ldots,x_{.p}]=[x_{1},\ldots,x_{n}]^{T}$ be a $n\times p$ matrix of experiment and $J\subseteq\{1,2,\ldots,p\}=F$ be an arbitrary subset of the full model $F$ , ${\bar{J}}=F\setminus{J}$ . As $J$ may be viewed as a sequence of zeros and ones on $F$ , $|J|=|J|_{1}$ denotes cardinality of $J$ . Let $\beta_{J}$ be a subvector of $\beta$ with elements having indices in $J$ , $X_{J}$ be a submatrix of $X$ with columns having indices in $J$ and $r(X_{J})$ denotes the rank of $X_{J}$ . Moreover, let $H_{J}=X_{J}(X_{J}^{T}X^{J})^{-1}X_{J}^{T}$ be an orthogonal projection matrix onto the subspace spanned by columns of $X_{J}$ . Linear model pertaining to predictors being columns of $X_{J}$ will be frequently identified as $J$ . In particular, let $T$ denotes a true model that is $T={\rm supp}(\mathring{\beta})=\{j\in F:\mathring{\beta}_{j}\neq 0\}$ and $t=|T|$ . Next, for $\beta\in\mathbb{R}^{p}$ and $q\geq 1$ let $|\beta|_{q}=(\sum_{j=1}^{p}|\beta_{j}|^{q})^{1/q}$ be the $\ell_{q}$ norm. The only exception is the $\ell_{2}$ norm, for which we will use the special notation $||\beta||.$

In the further argumentation important roles are played by

[TABLE]

for $k=1,\ldots,t-1.$ They describe how much the true value $X\mathring{\beta}=X_{T}\mathring{\beta}_{T}$ differs from its projections on submodels of true set $T.$ For the normal linear model they are related to the Kullback-Leibler divergence between two normal densities [23, Section 3]. Finally, we define the sum of balls

[TABLE]

We assume that $t\leq\bar{t}<n\wedge p,$ which implies that (6) consists of sparse vectors.

We assume also that a total cumulant function

[TABLE]

is convex and, additionally, strongly convex at $\mathring{\beta}$ in a sense that there exists $c\in(0,1]$ such that for all $\beta\in\mathbb{B}$ we have

[TABLE]

We note that this crucial property of the total cumulant is slightly weaker than an usual definition of strong convexity which would have a second derivative of $g$ at $\mathring{\beta}$ in place of $X^{T}X$ . Let us remark that $\dot{g}(\beta)=X^{T}\vec{\dot{\gamma}}(X\beta)$ .

Moreover, we assume that centred responses $\varepsilon_{i}=y_{i}-\mathbb{E}y_{i}$ have a subgaussian distribution with the same constant $\sigma$ , that is for $i=1,2,\dots,n$ and $u\in\mathbb{R}$ we have

[TABLE]

Examples. Two most important representatives of GLM are the normal linear model and logistic regression. In the normal linear model

[TABLE]

where the noise variables $\varepsilon_{i}$ are independent and normally distributed $N(0,\mathring{\sigma}^{2}).$ Therefore, assumptions (4), (8) and (9) are satisfied with $\gamma(\eta_{i})=\eta_{i}^{2}/2,$ $c=1$ and any $\sigma\geq\mathring{\sigma},$ respectively. In logistic regression the response variable is dichotomous $y_{i}\in\{0,1\}$ and we assume that

[TABLE]

In this model assumptions (4) and (8) are satisfied with $\gamma(\eta_{i})=\log(1+\exp(\eta_{i}))$ and

[TABLE]

respectively. Finally, as $(\varepsilon_{i})$ are bounded random variables, then (9) is satisfied with any $\sigma\geq 1/2$ .

2.2 Fitting Algorithms

For estimation of $\mathring{\beta}$ we consider a loss function

[TABLE]

where $y=(y_{1},\ldots,y_{n})^{T}$ . It is easy to see that $\dot{\ell}(\beta)=X^{T}(\vec{\dot{\gamma}}(X\beta)-y)$ , and consequently $\dot{\ell}(\mathring{\beta})=-X^{T}\varepsilon$ for $\varepsilon=(\varepsilon_{1},\ldots,\varepsilon_{n})^{T}$ . Moreover, observe that $\mathring{\beta}=\text{argmin}_{\beta}\mathbb{E}\ell(\beta)$ and (8) is equivalent to strong convexity of $\ell$ at $\mathring{\beta}$ for all $\beta\in\mathbb{B}$

[TABLE]

Let $\widehat{\beta}^{ML}_{J}=\text{argmin}_{\beta_{J}}\ell(\beta_{J})$ denotes a minimum loss estimator based on $y$ and $\{x_{.j},j\in J\}$ . Denote $\ell_{J}=\ell(\widehat{\beta}^{ML}_{J})$ . Note that for GLM estimator $\widehat{\beta}^{ML}_{J}$ coincides with a maximum likelihood estimator for a model pertaining to $J$ .

In Algorithm 1 we present two selection procedures: the first one is the well-known Thresholded Lasso (TL) method which consists of retaining only these variables for which absolute values of their Lasso estimators exceed a certain threshold $\tau$ . The second one is novel and named the Screening–Selection (SS) procedure. It finds the minimal value of the Generalized Information Criterion (GIC) for the nested family which is constructed using ordering of the nonzero Lasso coefficients.

In the classical (low-dimensional) case model selection is often based on the Bayesian information criterion that is similar to the second step of Algorithm 1 with $\lambda^{2}=\log n.$ Model selection properties of this procedure are proven in [24]. In the current paper we validate that information criteria are also useful in the high-dimensional case. As it will be shown the important difference is that in the high-dimensional scenario parameter $\lambda^{2}$ should be proportional to $\log p$ instead of $\log n$ (at least for linear models with the subgaussian noise).

2.3 A Selection Error Bound for TL

In order to make the exposition simpler we assume that columns of $X$ are normalized in such a way that $||x_{.j}||=1$ for $j=1,\ldots,p$ . Moreover, let $\mathring{\beta}_{min}=\min_{j\in T}|\mathring{\beta}_{j}|$ .

First we generalize a characteristic of linear models which quantifies the degree of separation between the true model $T$ and other models, which was introduced in Ye and Zhang [34]. For $a\in(0,1)$ consider a signed pseudo-cone

[TABLE]

For $q\geq 1$ and $a\in(0,1)$ let a Sign-Restricted Pseudo-Cone Invertibility Factor (SCIF) be defined as

[TABLE]

Notice that in the linear model the numerator of (13) is simply $|X^{T}Xv|_{\infty}.$ In the case $n>p$ one usually uses the minimal eigenvalue of the matrix $X^{T}X$ to express the strength of correlations between predictors. Obviously, in the high-dimensional scenario this value is zero. Therefore, SCIF can be viewed as an useful analog of the minimal eigenvalue for the case $p>n.$

We let $\zeta_{a}=\zeta_{a,\infty}$ . In comparison to more popular restricted eigenvalues [2] or compatibility constants [29], variants of SCIF enable sharper $\ell_{q}$ estimation error bounds of the Lasso for $q>2$ [34, 14, 36].

The following lemma is a main tool in proving model selection consistency of the TL algorithm. For the linear model it was stated in [34, Theorem 3]. We generalize it to GLM.

Lemma 1.

If $\ell$ is convex and $a\in(0,1)$ , then on $\{|X^{T}\varepsilon|_{\infty}\leq a\lambda\}$ we have

[TABLE]

.

Next, we bound the selection error of the TL algorithm in GLM.

Theorem 1.

If $\ell$ is convex, $(\varepsilon_{i})_{i}$ are subgaussian with $\sigma$ and for numbers $a_{1},a_{2}\in(0,1)$ we have

[TABLE]

then

[TABLE]

Constant $a_{2}$ is used to remove multiplicative factor $p$ from the exponential bound at the expense of slightly diminishing the exponent in (14). Note that assumptions of Theorem 1 stipulate that the truncation level $\tau$ is contained in the interval $[(1+a_{1})\lambda\zeta_{a_{1}}^{-1},\mathring{\beta}_{min}/2)$ , whose both endpoints are unknown, because $\zeta_{a_{1}}$ and $\mathring{\beta}_{min}$ are unknown. Therefore, in practice $\tau$ cannot be chosen that makes the TL algorithm non-constructive. Analogous theorems for FCP for linear models and logistic regression can be found in Fan et al. [7, Corollary 3 and Corollary 5 ]. However, they require an additional assumption on the minimal eigenvalue of $X_{T}^{T}X_{T}$ and the proof is more difficult. Moreover, the choice of the tuning parameters in these methods also requires unknown $\zeta_{a}$ or $\mathring{\beta}_{min}$ .

2.4 A Selection Error Bound for SS

A scaled Kullback-Leibler (K-L) distance between $T$ and its submodels is given in Shen et al., [25, 26] as

[TABLE]

or just $\min\limits_{k=1,\ldots,t-1}\delta_{k}/k,$ if we use notation (5). Different variants of the K–L distance have been often used in the consistency analysis of selection algorithms [23, Section 3.1], but $\delta$ defined in (15) seems to lead to optimal results [26, Theorem 1]. Now we introduce technical constants $a_{1},\ldots,a_{4}$ that allow to avoid ad-hoc coefficients in the main results and simplify asymptotic considerations. For given $1/2<a_{1}<1$ define $a_{2}=1-(1-\log(1-a_{1}))(1-a_{1})$ , $a_{3}=2-1/a_{1}$ and $a_{4}=\sqrt{a_{1}a_{2}}$ . Note that $a_{2},a_{3}$ and $a_{4}$ are functions of $a_{1}$ and obviously if $a_{1}\rightarrow 1$ , then $a_{2},a_{3},a_{4}\rightarrow 1$ . Moreover, for two real numbers $a,b$ we denote $a\vee b=\max(a,b)$ and $a\wedge b=\min(a,b).$

Theorem 2.

Assume (4), (8), (9) and that for $a_{1}\in(1/2,1)$ we have

[TABLE]

Then

[TABLE]

Selection consistency, that is asymptotic correctness of $\hat{T}_{SS},$ now easily follows.

Corollary 1.

Assume that $t=o(\log p)$ for $n\rightarrow\infty.$ Moreover, set $a_{1}=1-\sqrt{\frac{t}{2\log p}}$ and $\lambda^{2}=\frac{2\sigma^{2}\log p}{a_{3}a_{2}a_{1}c}$ . Then $\lambda^{2}=c^{-1}(2\sigma^{2}\log p)(1+o(1))$ . If additionally $\mathring{\beta}$ is asymptotically identifiable, i.e.

[TABLE]

Remark 1.

Theorem 2 determines conditions on GLM and the SS algorithm for which the bound (17) on the selection error of SS holds. Corollary 1 states the easy interpretable result: if the true model is asymptotically identifiable, then SS with minimal admissible $\lambda$ is asymptotically consistent. Although the identifiability condition is not effectively verifiable, $\lambda$ can be explicitly given for linear models as

[TABLE]

and for logistic models as

[TABLE]

since $\sigma\geq 1/2$ . Let us consider subgaussian linear models and assume that $\sigma^{2}$ is known, which is a common condition in the literature investigating theoretical properties of variable selection procedures [34, 4, 7]. Then the parameter $\lambda$ of SS is given constructively provided that $t=o(\log p)$ . In contrast, TL or FCP are not constructive, because they require an additional parameter $\tau$ , that depends on unknown identifiability constants as SCIF.

In the literature concerning the Lasso and its modifications the smallest possible $\lambda$ is taken as the default value, because it makes the algorithm asymptotically consistent for the largest class of models (the same approach is adopted for prediction and estimation). Such $\lambda$ will be called the safest choice. It is interesting that for linear models GIC with $\lambda$ given by (18) was originally derived from the minimax perspective by Foster and George [9]. They called such selection the risk inflation criterion (RIC), because it asymptotically minimises the maximum predictive risk with respect to the oracle for the orthogonal matrix of experiment $X$ .

Remark 2.

A generic combination of the penalized log-likelihood (as TL or FCP) with GIC is considered in [8]. In the first step the method computes a path of models indexed by $\lambda$ and next GIC is used to choose the final model. They assume that the true model has to be on this path and use GIC with the penalty asymptotically larger than $\log p$ . Thus, their results are weaker and need more restrictive assumptions, which are given in [8, Section 6]. For instance, if the cumulant function $\gamma,$ defined in (4), has uniformly bounded second derivative, then we do not require its third derivative in contrast to [8]. Moreover, using Corollary 1 and assuming that the number $c$ is constant as in [8], the last step of our algorithm uses GIC with the safest choice $\lambda$ instead of $K\log p$ as in [8] for some $K\rightarrow\infty$ with $n\rightarrow\infty$ . It is worth to note that their results are obtained using the empirical processes theory, while the proof of Theorem 2 is based on elementary exponential inequalities for subgaussian variables given in subsection 2.6.

2.5 A Selection Error Bound for SS in Subgaussian Linear Models

In this part of the paper we show that SS is constructive for the linear model with the subgaussian noise. The main difference between Theorem 2 and the following Theorem 3 is that the lower bound on $\lambda^{2}$ in Theorem 3 does not depend on the dimension of the true model $t$ and the parameter $c$ (because $c=1$ for subgaussian linear models).

Theorem 3.

Consider the linear model with the subgaussian noise. Assume that there exists $0<a\leq\sqrt{1-(1+\log(2\log p))/(2\log p)}$ such that

[TABLE]

Then

[TABLE]

The safest choice of $\lambda$ in Theorem 3 does not depend on unknown expressions, which justifies the claim that our algorithm is constructive in the linear model with the subgaussian noise. Next, we compare the above result to Wang et al., [32] and Fan et al., [7].

Remark 3.

The algorithm in Wang et al., [32] has three steps: the Lasso, non-convex penalized linear regression and GIC with the parameter $C\log p,$ where $C\rightarrow\infty$ with $n\rightarrow\infty$ . Obviously, the SS algorithm is computationally faster, because it does not need the most time-consuming second step. Moreover, the first two steps of their algorithm form a variant of FCP, so their parameters are not given constructively as we explain in the discussion after Theorem 1. Finally, their assumptions are stronger than ours. Indeed, conditions in Wang et al., [32, Theorem 3.2, Theorem 3.5] lead to $n/(\delta t)=O(1)$ and $t\log p/n=o(1),$ where $\delta$ is defined in (15). From these two facts and Wang et al., [32, expression (3.4)] we obtain

[TABLE]

If we fix $\sigma^{2}$ and $\zeta_{a}^{2}$ as in Wang et al., [32, expression (3.4)], then in Theorem 3 we require only that

[TABLE]

In Wang et al., [32, Theorem 3.6] the second condition in (21) is weakened at the price of stronger conditions on the the design matrix. However, their assumptions are still more restrictive assumptions than ours.

Remark 4.

Fan et al., [7, Subsection 3.1] consider an algorithm based on non-convex penalization in linear models. Their procedure is model selection consistent, but is also not constructive. Indeed, in Fan et al., [7, Corollary 3] the parameter $\lambda$ has to be in the interval, whose both endpoints depends on unknown $||\beta_{\mathcal{A}}^{*}||_{min},\sqrt{s},\kappa_{linear}$ that are analogs of $\mathring{\beta}_{min},t,\xi_{a_{1}}$ in the current paper. In Fan et al., [7, Remark 5] it is shown that $\sqrt{s}$ can be avoided, but their algorithm is still not constructive in contrast to the SS algorithm.

2.6 Exponential Bounds for Subgaussian Vectors

This part of the paper is devoted to exponential inequalities for subgaussian random vectors. They are interesting by themselves and can be used in different problems than we consider. In the current paper they are main probabilistic tools that are needed to prove Theorems 1-3. Specifically, in lemma 2 (iii) we generalize the Wallace inequality for $\chi^{2}$ distribution [31] to the subgaussian case using the inequality for the moment generating function in lemma 2 (ii). The last inequality is proved by the decoupling technique as in the proof of Theorem 2.1 in [13].

Lemma 2.

Let $\varepsilon\in\mathbb{R}^{n}$ be a vector of zero-mean independent errors having subgaussian distribution with a constant $\sigma$ , $\nu\in\mathbb{R}^{n}$ , $0<a<1$ and $H$ be an orthogonal projection such that $tr(H)=m$ . Then

(i)

for $\tau>0$

[TABLE]

(ii)

[TABLE]

(iii)

for $\tau>1$

[TABLE]

3 Extension to General Convex Contrasts

In this part of the paper we investigate properties of the SS algorithm beyond GLM as well. The main assumption, that will be required, is convexity of the ,,contrast” function. We show that the SS algorithm is very flexible procedure that can be applied succesfully to the various spectrum of practical problems.

First, for $\beta\in\mathbb{R}^{p}$ and a contrast function $\phi:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}$ we define a loss function

[TABLE]

Considering the normal linear model one usually uses the quadratic contrast

[TABLE]

as we have done in Section 2. However, it is well known that the quadratic contrast is very sensitive to the distribution of errors $\varepsilon_{i}$ and does not work well, if this distribution is heavy-tailed and outliers appear. To overcome this difficulty we can use the absolute contrast

[TABLE]

Next, working with dichotomous $y_{i}$ we can apply logistic regression that belongs to GLM and has been considered in Section 2. In this case we have

[TABLE]

But there are also very popular and efficient algorithms called support vector machines (SVM) that use, for instance, the following quadratic hinge contrast

[TABLE]

Our main assumption is that the contrast function $\phi$ is convex with respect to $\beta.$ All examples given above satisfy this property. Notice that they need not be differentiable nor decompose, as in (10) for GLM, into the sum of the nonrandom cumulant $\gamma$ and the random linear term $y_{i}\beta^{T}x_{i}$ . The SS algorithm for convex contrasts is the same as in Algorithm 1.

We add few definitions and notations to those in the previous parts of the paper. We start with defining two balls: the first one is the $l_{1}$ -ball $B_{1}(r)=\{\beta:|\beta-\mathring{\beta}|_{1}\leq r\}$ with radius $r>0.$ The second one is the $l_{2}$ -ball $B_{2,J}(r)=\{\beta_{J}:||X_{J}(\mathring{\beta}-\beta_{J})||^{2}\leq r^{2}\}$ with radius $r>0,$ where $J$ is a (sparse) subset of $\{1,\ldots,p\}$ such that $T\subset J,r(X_{J})=|J|\leq\bar{t}.$ Recall that $\mathring{\beta}$ is, as previously, a minimizer of $\mathbb{E}\ell(\beta).$ Besides, let $B_{J}=B_{2,J}\left(\sqrt{\delta_{t-1}}\right),$ where $\delta_{t-1}$ is defined in (5). In further argumentation key roles are played by:

[TABLE]

and

[TABLE]

which are empirical processes over $l_{1}$ and $l_{2}$ -balls, correspondingly. We need also the compatibility factor that is borrowed from [4] and is an analog of SCIF defined in (13). Namely, for arbitrary $a\in(0,1)$ a compatibility factor is

[TABLE]

where $\mathcal{C}_{a}$ is a simplified version of (12), namely

[TABLE]

Convexity of the contrast function is the main assumption in this section. However, similarly to the previous section we need also the following strong convexity of $\mathbb{E}\ell(\beta)$ at $\mathring{\beta}$ : there exists $c_{1}\in(0,1]$ ( $c_{2}\in(0,1]$ , respectively) such that for each $\beta_{1}\in B_{1}(\mathring{\beta}_{min})$ ( $\beta_{2}\in\mathbb{B}$ defined in (6), respectively) we have for $i=1,2$

[TABLE]

Notice that in (25) we require the expected loss $\mathbb{E}\ell(\beta),$ not the loss $\ell(\beta),$ to be strongly convex. Therefore, the condition (25) can be satisfied easily even if the contrast function $\phi$ is not differentiable, for instance for absolute or quadratic hinge contrasts (see Remark 6). For GLM in section 2 the condition (11) is equivalent to (25) for $i=2,$ that will be explained in Remark 7.

To prove exponential bounds for GLM in subsection 2.4 we use subgaussianity that allows us to obtain probabilistic inequalities in Lemma 2. In this section we need the analog of (22) of the form: there exists $L>0$ and constants $K_{1},K_{2}>0$ such that for each $0<r\leq\mathring{\beta}_{min}$ and $z\geq 1$ we have

[TABLE]

Besides, the inequality (23) is replaced by the following: there exists $L>0$ and constants $K_{3},K_{4}>0$ such that for each $0<r\leq\sqrt{\delta_{t-1}},\;z\geq 1$ and $J$ such that $T\subset J,r(X_{J})=|J|\leq\bar{t}$ we have

[TABLE]

The detailed comparison between assumptions and results for models in this section and those for GLM in Theorem 2 is given in Remarks 7 and 8 after the main result of this section, which is now stated.

Theorem 4.

Fix $a_{1},a_{2}\in(0,1)$ and let $K_{i}$ be universal constants. Assume that (25), (26), (27) and

[TABLE]

Then

[TABLE]

Theorem 4 bounds exponentially the selection error of the SS algorithm. It extends Theorem 2 to the wide class of convex contrast functions. In particular, these contrasts can be nondifferentiable as in quantile regression or SVM. In Remarks 5 and 6 we discuss assumptions (26), (27) and (25) of Theorem 4, respectively. The detail comparison to Theorem 2 is given in Remarks 7 and 8. There we argue that Theorem 4 applied to GLM is only slightly worse than Theorem 2, which is devoted to GLM.

Remark 5.

The important assumptions of Theorem 4 are conditions (26) and (27). They can be proved using tools from the empirical process theory such that concentration inequalities [20], the Symmetrization Lemma [30, Lemma 2.3.1] and the Contraction Lemma [19, Theorem 4.12]. It is quite remarkable that to get (26) or (27) we need only one new condition. Namely, we need that the contrast function is Lipschitz in the following sense: there exists $L>0$ such that for all $x_{i},y,0<r\leq\mathring{\beta}_{min}$ and $\beta,\tilde{\beta}\in B_{1}(r)$

[TABLE]

Indeed, (26) with $K_{1}=8\sqrt{2},K_{2}=4$ follows from the above-mentioned tools and can be established as in [4, Lemma 14.20] combined with [20, Theorem 9]. On the other hand, to get (27) with $K_{3}=8,K_{4}=2$ we need (31) to be satisfied for all $x_{i},y,0<r\leq\sqrt{\delta_{t-1}},J:T\subset J,r(X_{J})=|J|\leq\bar{t}$ and $\beta,\tilde{\beta}\in B_{J}.$ This fact can be obtained as in [4, Lemma 14.19] combined again with [20, Theorem 9]. Notice that logistic and absolute contrast functions satisfy (31) with $L=2$ and $L=1,$ respectively. The property (31) is also satisfied for the quadratic hinge contrast, but in this case $L$ depends on $n$ .

Remark 6.

The condition (25) is often called the ”margin condition” in the literature. For quadratic and logistic contrasts we have considered it in the previous section. To prove it for SVM with the quadratic hinge contrast one can use methods based on the modulus of convexity [1, Lemma 7]. For linear models with the absolute contrast it can be established analogously to [17, Lemma 3], if densities of noise variables are lower-bounded in a neighbourhood of the origin.

Remark 7.

*Similarities to Theorem 2. *

We compare Theorem 2 to Theorem 4 applied to GLM. We can calculate that for quadratic and logistic contrasts we have

[TABLE]

and

[TABLE]

where $g$ is a total cumulant function (7). Therefore, the condition (25) for $l_{2}$ -balls is the same as (11). Besides, we have for $\varepsilon=y-\mathbb{E}y$ that

[TABLE]

In Theorem 2 we supppose that $\varepsilon_{1},\varepsilon_{2},\ldots,\varepsilon_{n}$ are independent and subgaussian. These assumptions are used to establish (22) and (23), that are crucial in the proof of Theorem 2. Notice that (32) implies that for GLM we have $Z(r)/r\leq\left|X^{T}\varepsilon\right|_{\infty}$ and $U_{J}(r)/r\leq\sqrt{\varepsilon^{T}H_{J}\varepsilon}.$ Therefore, assumptions (26) and (27) are analogs of (22) and (23), respectively. Moreover, the condition (28) and the result in (30) differ only in constants from their counteparts in Theorem 2.

Remark 8.

*Differences from Theorem 2. *

The SS algorithm consists of two steps. In the last paragraph we have clarified that the theoretical analysis of the second step (selection) in Theorem 2 is not significantly simpler than for general models with convex contrasts. However, we can find differences while investigating the first step (screening based on the lasso). Working with GLM we exploit differentiability of contrasts and the useful decomposition (10). Due to that the right-hand side of (16) is usually better than (29), because in Theorem 4 we have to assume (25) also with respect to $l_{1}$ -balls and $c_{1}^{2}$ appears in (29).

4 Experiments

While convenient for theoretical analysis TL, FCP or SS algorithms use the Lasso estimators only for one value of the penalty, the practical Lasso implementations return coefficient estimators for a given net of it as in the R package glmnet described in [10]. Similarly, using a net of penalty values, the Minimax Concave Penalty (MCP) algorithm, a popular realization of FCP, has been implemented for linear and logistic models in the R packages SparseNet [21] and ncvreg [3]), respectively. Thus, we propose a net modification of the SS algorithm, which we call SSnet and state in Algorithm 2. In the first step this procedure calculates the Lasso for a net of $m$ values of an input grid: $\lambda^{L}_{1},\ldots,\lambda^{L}_{m}$ . Then the final model is chosen using GIC in a similar way to the SS algorithm.

We remind that for linear models $\lambda$ depends on the noise variance, which is unknown in practice. Estimation of $\sigma^{2}$ is quite difficult, especially in the high-dimensional scenario. Computing the Lasso for the whole net of tuning parameters is a simple way to overcome this problem. Obviously, in the second step of our procedure we use GIC, so we require an estimator of $\sigma^{2}.$ But in this step we work with models, whose dimensionality is significantly reduced by the Lasso. Therefore, the classical estimator of $\sigma^{2}$ , which uses residual sum of squares, can be applied.

We performed numerical experiments fitting sparse linear and logistic models to high-dimensional benchmark simulations and real data sets. We investigated properties in model selection and prediction of the SSnet algorithm and its competitor, which was a net version of MCP. The fair comparison of these two net algorithms is difficult, because their input grids depend on these algorithm themselves and their runs. Therefore, we decide to introduce an output grid, which is the same for both algorithms $\lambda_{1}<\ldots<\lambda_{r}$ in Algorithm 2. We compare the algorithms using graphs, which describe the interplay between a prediction error (PE) and a model dimension (MD). In these graphs we show the averaged PE and MD over simulation runs (data sets) for distinct values from the output grid. In particular, we want to find the minimum PE and the optimal penalty. Basing only on input grids, it would be more difficult to interpret such averaging as well as to locate the PE minimum. Finally, using such plots makes our procedure more user-friendly. Namely, observing the relation between dimensionality of a model and its prediction accuracy a user can decide, which estimator is best suited for both: describing the phenomena and his/her expectations.

4.1 Simulated Data

For linear models we studied the performance of two algorithms: SSnet and MCP computed using the R package SparseNet [21] for the default 9 values of $\gamma$ and 50 values of $\lambda$ . Our algorithm used the R package glmnet [10] to compute the Lasso estimators for 50 lambdas on a log scale.

We generated samples $(y_{i},x_{i})$ , $i=1,\ldots,n$ from the normal linear model. Two vectors of parameters were considered: $\mathring{\beta}^{(1)}=(3,1.5,0,0,2,0_{p-5}^{T})^{T},$ as in [32] as well as $\mathring{\beta}^{(2)}=(0_{p-10}^{T},s_{1}\cdot 2,s_{2}\cdot 2,\ldots,s_{10}\cdot 2)^{T}$ , where $s_{l}$ equals 1 or -1 with equal probability, $l=1,\ldots,10$ chosen separately for every run as in experiment 2 in [33]. The rows of $X$ were iid $p$ -dimensional vectors $x_{i}\sim N(0_{p},\Xi)$ . We considered auto-regressive structure of covariance matrix that is $\Xi=\big{(}\rho^{|i-j|}\big{)}_{i,j=1}^{p}$ for $\rho=0.5,0.7,0.9$ . The columns of $X$ were centred and normalized so that $||x_{\cdot j}||^{2}=n$ and $\varepsilon\sim N(0_{n},\sigma^{2}I_{n})$ . The plan of experiments is presented in Table 1 with SNR meaning a Signal to Noise Ratio.

For every experiment the results were based on $N=1000$ simulation runs.

The output grid was chosen as $\lambda_{k}^{2}=c_{k}\cdot\log(p)\cdot\sigma^{2},k=1,\ldots,r,$ where $(c_{k})_{k=1}^{r}=(.25,.5,\ldots,7.5).$ We reported the mean model dimension (MD) that is $|\widehat{T_{k}}|,k=1,\ldots,r$ and the mean squared prediction error (PE) on new data set with 1000 observations equalling ${\|X\mathring{\beta}-X\widehat{\beta}_{\widehat{T_{k}}}\|^{2}}/({n\sigma^{2}})$ , where $\widehat{\beta}_{\widehat{T_{k}}}$ is the post-selection OLS estimator. Values of $(MD(k),PE(k))$ for the models chosen by GIC=GIC( $\lambda_{k}$ ) were calculated and averaged over simulations.

The results are presented in two first columns of Figure 1. The two vertical lines indicate models chosen using GIC with $c_{k}=2.5$ : the black one for SSnet and the red one for SparseNet. The blue vertical line denotes the true model dimension.

For logistic models we compared the performance of two algorithms: SSnet and MCP implemented in the R package ncvreg for the default value of $\gamma=3$ and 100 values of $\lambda$ . As for linear models, SSnet called the R package glmnet [10] to compute the Lasso estimators for 20 lambdas on a default log scale. We performed experiments very similar to those for linear models, changing only $n$ and the number of simulation runs to $N=500$ . The plan of experiments is shown in the Table 2. Random samples were generated according to the binomial distribution. We reported prediction error defined as misclassification frequency on new data set with 1000 observations. The results organized in a similar way as for the linear models are shown in columns 3–4 of Figure 1. The two vertical lines indicate models chosen using GIC with $c_{k}=2$ , the black one for SSnet and the red one for ncvreg.

Summarizing the results of the simulation study, one can observe that SSnet for linear models turned out to have equal or lower PE in almost all of the experimental setups. The differences are most visible in setups with autocorrelation structure with $\rho=0.7$ . The value $c_{k}=2.5$ in GIC usually gave satisfactory results. The mean execution time of SSnet was approximately 3 times faster than for SparseNet. SSnet for logistic regression gave higher accuracy as ncvreg, but execution time of SSnet was approximately 10 times longer than ncvreg. The value $c_{k}=2$ in GIC usually gave satisfactory results.

4.2 Real Data Sets

The methylation data set was described in [12]. It consist of the age of 656 human individuals together with values of phenotypic features such as gender and body mass index and of genetic features, which are methylation states of 485 577 CpG markers. Methylation was recorded as a fraction representing the frequency of methylation of a given CpG marker across the population of blood cells taken from a single individual. In our comparison we used only genetic features from which we extracted 193 870 most relevant CpGs according to onefold t-tests with Benjamini-Hochberg adjustment, FDR=.05. We compared the root mean squared errors (PE) and model dimensions (MD) for SSnet and SparseNet. To calculate PE we used 10-fold cross-validation. For each value of hyperparameter $c_{k}=.25,.5,\ldots,7.5$ values of $(MD(k),PE(k))$ for the models chosen by GIC=GIC( $\lambda_{k}$ ) were calculated and averaged over 10 folds. The results are presented in Figure 2. SparseNet yields a path of models for each value of parameter $\gamma=g1,\ldots,g9$ . We present results for $g1$ , corresponding to the Lasso, for $g9$ , close to the best subset, and for an intermediate value $g8$ in Figure 2. Remarkably, SSnet gives uniformly smaller PE than SparseNet for all $MD\geq 3$ . The two vertical lines indicate models chosen using GIC with $c_{k}=2.5$ : the black one for SSnet and the red one for SparseNet.

A logistic model was fitted to the breast cancer data described in [11] which concerns small, invasive carcinomas without axillary lymph node involvement to predict metastasis of small node-negative breast carcinoma. There were 168 patients: 111 with no event after diagnosis labeled as good, and 57 with early metastasis labeled as poor. The number of predictors in this data was 2905. We compared the mean errors of binary prediction (PE) and model dimensions (MD) for SSnet and ncvreg. For each of 80 hyperparameters $c_{k}=0.001;0.005;0.01;0.02,\ldots,$ $0.1,0.15,\ldots,1,\ldots 50$ values of $(MD(k);PE(k))$ for the models chosen by $GIC=GIC(\lambda_{k})$ were calculated and averaged over 10 folds. Again to calculate PE we used 10-fold cross-validation. The results are presented in Figure 3. It is hard to find the minimal of PE for ncvreg. If we increased the net of $c_{k}$ , maybe we would obtain a smaller PE for ncvreg, but the model would be significantly larger than for SSnet. The algorithms work comparably, but again SSnet was 3 times longer. The two vertical lines indicate models chosen using GIC with $c_{k}=2$ : the black one for SSnet and the red one for ncvreg.

5 Conclusions

In the paper we propose the SS algorithm which is an alternative method to TL and FCP of improving the Lasso. Our approach encompasses fundamental models for prediction of continuous as well as of binary responses and the main results are stated jointly for both of them. Its assumptions are stated in the most general form which allows proving exponential bound without obscuring the essence of the results and comparing the bounds for both models. By simplifying SOS to SS we were able to simplify reasoning used for SOS and then extend them from normal linear models to general predictive models.

We propose an algorithm SSnet, which is a generalization of the SS algorithm for general predictive models. Using a net of parameters, SSnet avoids the problem of choosing one specific $\lambda$ . The gap between theoretical results for SS and the SSnet algorithm is similar to the difference between theory for FCP and it implementations SparseNet or ncvreg. Numerical experiments reveal that for normal linear models SSnet is more accurate than SparseNet and three times faster, whereas for logistic models performance of SSnet is also better than the performance of ncvreg with computing times 3-10 times longer.

We have shown in simulations (dotted vertical lines in Figure 1) that predictively optimal $\lambda$ for normal linear models equals approximately $\sqrt{2.5\sigma^{2}\log p}$ , which is close to (18) and for logistic models is $\sqrt{2\log p}$ , which together with (19) suggests that $c\approx 1/4$ . The relations between the safest choice $\lambda$ discussed in Remark 1 and predictively optimal $\lambda$ are important applications of our theory.

In the package ,,DMRnet” one can find the SOSnet algorithm, which is a net version of SOS from [23]. SOSnet can be viewed as SSnet with additional step (,,Ordering”) based on refitting estimators and calculating Wald statistics. Results of our numerical experiments, which are not presented here, suggest that SOSnet can improve the quality of SSnet, especially in the linear model.

Appendix A Proofs and auxiliary results

In the following subsections we present proofs of the results stated in the paper.

A.1 Proof of Lemma 2

Let $Z=\varepsilon^{T}\nu/||\nu||$ and $a>0$ . From Markov’s inequality we obtain

[TABLE]

Minimizing the last expression w.r.t. $a$ gives part (i).

Let $\xi\sim N(0,I_{n})$ be a vector of iid standard normal errors independent of $\varepsilon$ . We have

[TABLE]

Thus part (ii) follows from a known formula for the moment generating function of the $\chi^{2}$ distribution.

From Markov’s inequality and the part (ii) of this lemma we have

[TABLE]

Thus, minimizing the last expression w.r.t $a$ we obtain part (iii). ∎

A.2 Proof of Lemma 1

Let

[TABLE]

and $\hat{\varepsilon}=y-\vec{\dot{\gamma}}(X\hat{\beta})$ . We have $\dot{\ell}(\hat{\beta})=-X^{T}\hat{\varepsilon}$ and from the Karush-Kuhn-Tucker (KKT) theorem we obtain equations

[TABLE]

Let $\Delta=\hat{\beta}-\mathring{\beta}$ and $\nu\in\mathbb{R}^{p}$ be such that ${\rm sgn}(\nu_{\bar{T}})={\rm sgn}(\Delta_{\bar{T}})$ . We have $\nu_{J}^{T}X_{J}^{T}\hat{\varepsilon}=\lambda|\nu_{J}|_{1}$ for $J\subseteq\bar{T}$ and consequently

[TABLE]

Then letting $\nu=\Delta_{J}$ for $J\subseteq\bar{T}$ we have $D(\nu)\leq 0$ . Moreover, for $\nu=\Delta$ we have from convexity of $g$ that

[TABLE]

Indeed, $D_{0}(\beta_{1},\beta_{2})=(\beta_{1}-\beta_{2})^{T}[\dot{g}(\beta_{1})-\dot{g}(\beta_{2})]$ is the symmetrized Bregman divergence [14]. Hence $(1-a)|\nu_{\bar{T}}|_{1}\leq(1+a)|\nu_{T}|_{1}$ . Thus, on ${\cal A}_{a}$ , $\Delta\in{\cal C}_{a}$ and from the definition of $\zeta_{a}$ we obtain using KKT again

[TABLE]

∎

A.3 Proof of Theorem 1

First, we will prove that ${\cal A}_{a}\subseteq\{\hat{T}_{TL}=T\}$ for ${\cal A}_{a}$ defined in (33). From Lemma 1 and assumptions we have on ${\cal A}_{a}$

[TABLE]

where we recall that $\Delta=\hat{\beta}-\mathring{\beta}.$ Thus using (37) twice we have for $j\in T$ and $k\not\in T$

[TABLE]

and it follows that ${\cal A}_{a}\subseteq\{\hat{T}_{TL}=T\}$ . Morover, the assumptions of this theorem imply

[TABLE]

Hence, using Lemma 2 (i) we easily obtain

[TABLE]

∎

A.4 Proof of Theorem 2

Let us observe that the consecutive steps of the SS algorithm constitute a decomposition of the selection error into two parts:

[TABLE]

Therefore, Theorem 2 follows easily from (40) and (44) below.

Having in mind that for given $a_{1}\in(1/2,1)$ we let $a_{2}=1-(1-\log(1-a_{1}))(1-a_{1})$ , $a_{3}=2-1/a_{1}$ and $a_{4}=\sqrt{a_{1}a_{2}}$ , by arguments similar to those in the proof of Theorem 1 we obtain ${\cal A}_{a_{4}}\subseteq\{T\in{\cal J}\}$ . Moreover, assumptions $0<c\leq 1$ and $\frac{2\sigma^{2}\log p}{a_{3}a_{2}a_{1}c}\leq\lambda^{2}$ imply

[TABLE]

As a result

[TABLE]

Now we bound $\mathbb{P}\big{(}T\in{\cal J},\hat{T}\neq T\big{)}$ . In Lemma 4 and 5, given below, we bound probability that in the second step of the SS algorithm GIC chooses a subset of the true set, i.e.

[TABLE]

or a superset of $T,$ i.e.

[TABLE]

These lemmas state that both components of the selection error set are included in the critical sets of the following form

[TABLE]

or

[TABLE]

where $J$ is such that $T\subset J,r(X_{J})=|J|\leq\bar{t}.$ We consider only supersets $J$ that $r(X_{J})=|J|,$ because GIC corresponding to the superset $J$ such that $r(X_{J})<|J|$ is larger than GIC corresponding to a superset $J_{1}$ such that $J_{1}\subsetneq J$ and $r(X_{J_{1}})=|J_{1}|.$

Let us define $\tau_{0}=\frac{1}{1-a_{1}},~{}\tau_{1}=\frac{(1-a_{1})c\lambda^{2}}{t\sigma^{2}}$ and $\tau_{2}=\frac{a_{1}c\lambda^{2}}{\sigma^{2}}$ . Under our assumptions we have $2<\tau_{0}<\tau_{1}<\tau_{2}$ . Let $f_{2}(\tau)=1-(1+\log\tau)/\tau$ for $\tau>1$ . Of course $f_{2}$ is increasing, $f_{2}(1)=0$ and $f_{2}(\tau)\to 1$ for $\tau\to\infty$ . Consequently $a_{2}=f_{2}(\tau_{0})<f_{2}(\tau_{1})<f_{2}(\tau_{2})$ , which means that

[TABLE]

From Lemma 4, Lemma 5, Lemma 2 (iii) and (43) we get

[TABLE]

Using $\exp(d)-1\leq\log(2)^{-1}d$ for $0\leq d\leq\log(2)$ and the fact that probability is not greater than $1$ we obtain

[TABLE]

For $a_{3}\in(0,1)$ , assumption $\frac{2\sigma^{2}\log p}{a_{3}a_{2}a_{1}c}\leq\lambda^{2}$ implies $-a_{2}\tau_{2}+2\log p\leq-(1-a_{3})a_{2}\tau_{2}$ , therefore

[TABLE]

that finishes the proof of Theorem 2. ∎

Before we state Lemma 4 and 5 we introduce few notations. For $k=1,\ldots,t-1$ and $\delta_{k}$ in (5) we define

[TABLE]

First, we prove the following lemma which will be used in the proof of Lemma 4.

Lemma 3.

For $b\in(0,1)$ we have

[TABLE]

Proof.

For $\beta_{T}\in\partial B_{k}$ from assumption (11), Schwartz inequality and properties of the orthogonal projection $H_{T}$ we get

[TABLE]

Since the last expression does not depend on $\beta_{T}$ , we have for $b\in(0,1)$

[TABLE]

Let us notice that for $J\subset T$ such that $|T\setminus J|=k$ we have $||X_{T}(\widehat{\beta}_{J}^{ML}-\mathring{\beta}_{T})||^{2}\geq\delta_{k}$ , so $\widehat{\beta}_{J}^{ML}\notin\text{int}(B_{k})$ . Since $\mathring{\ell}$ is convex and $\mathring{\ell}(\mathring{\beta}_{T})=0$ we obtain ${\cal L}_{k}(b)\supseteq{\cal E}_{k}(bc\delta_{k}/2)$ . ∎

Lemma 4.

If for $a\in(0,1)$ we have $\lambda^{2}<c\delta\big{/}\big{(}1+\sqrt{2a}\big{)}^{2}$ , then $\{T\in{\cal J},\hat{T}\subset T\}\subseteq{\cal C}_{T}(ac\lambda^{2}).$

Proof.

From assumption of this lemma $\lambda^{2}<c\delta$ , so $b_{k}:=\lambda^{2}k/(c\delta_{k})<1$ . Hence for $k=1,\ldots,t-1$

[TABLE]

Moreover, if $a\in(0,1)$ then

[TABLE]

because $ab\leq(1-b)^{2}/4$ for $b=\lambda^{2}/(c\delta)$ . The last inequality is true, if

[TABLE]

Indeed, it is easy to check that (48) follows from the assumption as

[TABLE]

Finally from (46), Lemma 3 and (47) we obtain, respectively

[TABLE]

∎

Lemma 5.

For $a\in(0,1)$ we have

[TABLE]

Proof.

For $J$ such that $J\supset T,r(X_{J})=|J|\leq\bar{t}$ define $W_{J}=X_{J}(\mathring{\beta}_{J}-\hat{\beta}_{J}^{ML})$ and $m=|J\setminus T|$ . Notice that the event $\{\ell_{J}-\ell_{T}\leq-m\lambda^{2}/2\}$ can be decomposed as

[TABLE]

where $B_{J}=\{\beta_{J}:||X_{J}(\mathring{\beta}-\beta_{J})||^{2}\leq\delta_{t-1}\}.$ For $\hat{\beta}_{J}^{ML}\in B_{J}$ we have from assumption (11) and properties of orthogonal projection $H_{J}$

[TABLE]

so $\{\ell_{J}-\ell_{T}\leq-m\lambda^{2}/2,\hat{\beta}_{J}^{ML}\in B_{J}\}\subseteq\{\varepsilon^{T}H_{J}\varepsilon\geq mc\lambda^{2}\}$ . The second event in the sum (50) is obviously contained in $\{\hat{\beta}_{J}^{ML}\notin B_{J}\}.$ Now, we show that this event is also contained in $\{\varepsilon^{T}H_{J}\varepsilon\geq mc\lambda^{2}\}$ . To do it, we use argumentation similar to van de Geer, [28]. Thus, we define

[TABLE]

The random vector $\tilde{\beta}_{J}$ belongs to $B_{J},$ because

[TABLE]

Using convexity of the loss function $\ell$ we have

[TABLE]

Using this fact as well as assumption (11), Schwartz inequality and properties of the orthogonal projection $H_{J}$ we get

[TABLE]

It gives us

[TABLE]

Therefore, we obtain

[TABLE]

From assumptions of Theorem 2 we know that $mc\lambda^{2}\leq\frac{c^{2}\delta_{t-1}}{16}\>,$ that gives us $\{\hat{\beta}_{J}^{ML}\notin B_{J}\}\subset\{\varepsilon^{T}H_{J}\varepsilon\geq mc\lambda^{2}\}$ .

On the other hand, $\varepsilon^{T}H_{J}\varepsilon=\varepsilon^{T}H_{T}\varepsilon+\varepsilon^{T}(H_{J}-H_{T})\varepsilon$ , hence we obtain for $\tau>0$ and $a\in(0,1)$

[TABLE]

Finally

[TABLE]

∎

A.5 Proof of Theorem 3

The idea of the proof is similar to the proof of Theorem 2. However, here we consider the linear model with the subgaussian noise that allows us to apply simpler and better arguments than in the general case of GLM. Indeed, while working with the event $\{T\in{\cal J},\hat{T}\subset T\}$ we can use the result for the scalar product (Lemma 2 (i)) instead of the one for quadratic forms (Lemma 2 (ii)). Besides, when considering $\{T\in{\cal J},\hat{T}\supset T\}$ we need to upper bound probability of $\{\varepsilon^{T}H_{J}\varepsilon-\varepsilon^{T}H_{T}\varepsilon\geq|J\setminus T|\lambda^{2}\}$ instead of $\{\varepsilon^{T}H_{J}\varepsilon\geq|J\setminus T|\lambda^{2}\}.$ Obviously, probability of the former event is smaller.

As previously, we work with the error decomposition (39). Probability of the event $\{T\not\in{\cal J}\}$ can be bounded in the same way as in the proof of Theorem 2, if we put $a_{1}=\sqrt{a_{2}}=a$ . However, working with subsets and supersets of $T$ is different.

We start with subsets of $T.$ For $J\subset T$ we denote

[TABLE]

and

[TABLE]

Notice that $\ell_{J}+|J|\lambda^{2}/2\leq\ell_{T}+t\lambda^{2}/2$ is equivalent to

[TABLE]

Whence

[TABLE]

where $k=t-|J|$ and $\delta_{T\ominus J}\geq\delta_{k}\geq k\delta$ by the definiton of $\delta.$ Thus, from Lemma 2 (i) we have

[TABLE]

For the penultimate inequality we use again the inequality $\exp(c)-1\leq log(2)^{-1}c$ for $0\leq c\leq log(2)$ and the fact that probability is not greater than $1$ . For the last inequality above we used

[TABLE]

which is satisfied for

[TABLE]

where

[TABLE]

It is easy to check that (52) follows from the assumption $\lambda^{2}\leq\delta/(2+2a)^{2}$ as

[TABLE]

Now, we consider supersets of $T.$ As previously we consider only such supersets $J$ that $r(X_{J})=|J|\leq\bar{t}.$ We have $\ell_{T}-\ell_{J}=\varepsilon^{T}(H_{J}-H_{T})\varepsilon/2$ . Using Lemma 2 (iii), as in the proof of Theorem 2, we obtain

[TABLE]

Finally, we get the following inequalities

[TABLE]

∎

A.6 Proof of Theorem 4

In section 3 we state Theorem 4 with simplified constants $K_{i}.$ Its general form is given below:

Fix $a_{1},a_{2}\in(0,1).$ Assume that (25), (26), (27) and

[TABLE]

Then

[TABLE]

The main difference between proofs of Theorems 2 and 4 is that here we investigate properties of the expected loss $\mathbb{E}\ell(\cdot)$ instead of the loss $\ell(\cdot).$ It relates to the fact that the former function is more regular. To be more precise, in many parts of the proof we work with expressions of the form

[TABLE]

where $\tilde{\beta}$ is a random vector contained in some ball $\tilde{B}.$ Clearly, we can transform (55) into

[TABLE]

The first term in (56) can be handled using the regularity of the function $\mathbb{E}\ell(\cdot)$ given in assumption (25), while to work with the latter we apply assumptions (26) or (27).

As previously, we work with the error decomposition (39) and start with bounding probability of $\{T\notin\mathcal{J}\}.$ Take $a_{1}\in(0,1)$ and define $r^{*}=\frac{4\lambda}{(1-a_{1})c_{1}\kappa_{a_{1}}}\>.$ From (54) we have $r^{*}\leq\mathring{\beta}_{min}.$ Consider the following event

[TABLE]

which has probability not less than

[TABLE]

Indeed, if we take $z=\frac{a_{1}\lambda}{2K_{1}L\sqrt{\log(2p)}}$ and $r=r^{*}$ in (26), then using again (54) we obtain $z\geq 1$ and (58) follows.

Similarly to the proof of Bühlmann and van de Geer, [4, Theorem 6.4], we can show that the event (57) implies that

[TABLE]

From (59) and (54) we can obtain separability of the Lasso. Namely, for each $j\in T,k\notin T$ that

[TABLE]

Now we consider probability that GIC chooses a submodel of $T$ in the second step of the SS algorithm. Using the definition (45) we obtain

[TABLE]

Fix $k$ and $J$ such that $J\subset T,|T\setminus J|=k.$ We take arbitrary $\beta_{T}\in\partial B_{2,T}\left(\sqrt{\delta_{k}}\right)$ and recall that $\mathring{\ell}(\beta)=\ell(\beta)-\ell(\mathring{\beta}).$ It is clear that $\delta_{t-1}\geq\delta_{k},$ so we can use (25) for $\beta_{T}$ and obtain

[TABLE]

Since the last expression does not depend on $\beta_{T}$ we have

[TABLE]

Proceeding as in the proof of Lemma 3 we obtain

[TABLE]

If we take $b_{k}=\frac{k\lambda^{2}}{c_{2}\delta_{k}}\>,$ then

[TABLE]

From (54) we have (as in the proof of Lemma 4)

[TABLE]

Take

[TABLE]

which is not smaller than one by (60) and (54). Thus, using (27) with $r=\sqrt{\delta_{k}}$ and (60) we have

[TABLE]

Therefore,

[TABLE]

because $\lambda^{2}\geq\frac{K_{3}^{2}tL^{2}}{a_{2}c_{2}}$ by (53).

Next, we consider choosing a supermodel of $T$ by GIC. Fix the set $J\supset T,r(X_{J})=|J|\leq\bar{t}$ and denote $m=|J\setminus T|.$ Therefore, we have

[TABLE]

because $\ell(\hat{\beta}_{T})\leq\ell(\mathring{\beta})$ and $\mathbb{E}\ell(\hat{\beta}_{J})\geq\mathbb{E}\ell(\mathring{\beta}).$ We will prove that $D_{J}$ is contained in

[TABLE]

for $r=2\lambda\sqrt{\frac{m}{c_{2}}}\>.$ The event $D_{J}$ can be decomposed as

[TABLE]

It is clear that the first event on the right-hand side of (62) is contained in (61) and the second one is contained in $\{\hat{\beta}_{J}\notin B_{2,J}(r)\}.$ Now we prove that $\{\hat{\beta}_{J}\notin B_{2,J}(r)\}$ is also contained in (61). Our argumentation is similar to the proof of Lemma 5, but here we take $u=\frac{r}{r+d_{J}}.$ Therefore, we obtain

[TABLE]

Moreover, the following bound

[TABLE]

is implied by (25), because $\tilde{\beta}_{J}\in B_{2,J}(r)\subset B_{J}.$ This inclusion comes from (54), because

[TABLE]

Taking (63) and (64) we have

[TABLE]

and $\frac{c_{2}r^{2}}{8}=m\lambda^{2}/2$ by the choice of $r.$ Therefore, we have just proved that $D_{J}$ is contained in (61). To finish the proof we need sharp enough upper bound of (61). It can be obtained using (27) with $z=\frac{\lambda\sqrt{mc_{2}}}{4K_{3}L\sqrt{|J|}}$ that gives us

[TABLE]

Notice that $z\geq 1,$ because using (54)

[TABLE]

Thus, we obtain the following bound on probability of choosing a supermodel

[TABLE]

where we use the fact that $\frac{K_{4}\lambda^{2}c_{2}}{16K_{3}^{2}L^{2}}\geq 2\log(p-t)$ from (54) and two inequalities ${{p-t}\choose{m}}\leq\frac{(p-t)^{m}}{m!}$ and $\exp(b)-1\leq(\log 2)^{-1}b$ for $b\in(0,\log 2).$ ∎

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bartlett et al., [2006] Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. (2006). Convexity, classification and risk bounds. Journal of the American Statistical Association , 101:138–156.
2Bickel et al., [2009] Bickel, P., Ritov, Y., and Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics , 37:1705–1732.
3Breheny and Huang, [2011] Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics , 5:232–253.
4Bühlmann and van de Geer, [2011] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-dimensional Data . Springer, New York.
5Efron et al., [2004] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics , 32:407–499.
6Fan and Li, [2001] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association , 96:1348–1360.
7Fan et al., [2014] Fan, J., Xue, L., and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics , 42:819–849.
8Fan and Tang, [2013] Fan, Y. and Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society. Series B (Statistical Methodology) , 75:531–552.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Improving Lasso for Model Selection and Prediction

Abstract

1 Introduction

2 Subgaussian Generalized Linear Models

2.1 Models

2.2 Fitting Algorithms

2.3 A Selection Error Bound for TL

Lemma 1**.**

Theorem 1**.**

2.4 A Selection Error Bound for SS

Theorem 2**.**

Corollary 1**.**

Remark 1**.**

Remark 2**.**

2.5 A Selection Error Bound for SS in Subgaussian Linear Models

Theorem 3**.**

Remark 3**.**

Remark 4**.**

2.6 Exponential Bounds for Subgaussian Vectors

Lemma 2**.**

3 Extension to General Convex Contrasts

Theorem 4**.**

Remark 5**.**

Remark 6**.**

Remark 7**.**

Remark 8**.**

4 Experiments

4.1 Simulated Data

4.2 Real Data Sets

5 Conclusions

Appendix A Proofs and auxiliary results

A.1 Proof of Lemma 2

A.2 Proof of Lemma 1

A.3 Proof of Theorem 1

A.4 Proof of Theorem 2

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

A.5 Proof of Theorem 3

A.6 Proof of Theorem 4

Lemma 1.

Theorem 1.

Theorem 2.

Corollary 1.

Remark 1.

Remark 2.

Theorem 3.

Remark 3.

Remark 4.

Lemma 2.

Theorem 4.

Remark 5.

Remark 6.

Remark 7.

Remark 8.

Lemma 3.

Lemma 4.

Lemma 5.