Adaptive Huber Regression
Qiang Sun, Wenxin Zhou, and Jianqing Fan

TL;DR
This paper introduces adaptive Huber regression, a robust method for handling heavy-tailed data and outliers in high-dimensional settings, with theoretical guarantees and practical applications.
Contribution
It develops an adaptive robust regression framework that adjusts to data moments, providing sharp phase transition results and extending to heavy-tailed predictors and noise.
Findings
Achieves sub-Gaussian deviation bounds without sub-Gaussian data assumptions for δ ≥ 1
Demonstrates a smooth, optimal phase transition in estimation accuracy based on data tail heaviness
Shows improved robustness and prediction in heavy-tailed genetic data applications
Abstract
Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded -th moment for any . We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when , the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime . Furthermore, this transition is smooth and…
| Noise | AHR | OLS | ||
| mean | std | mean | std | |
| Normal | 0.566 | 0.189 | 0.567 | 0.191 |
| Student’s | 0.806 | 0.651 | 1.355 | 2.306 |
| Log-normal | 3.917 | 3.740 | 8.529 | 13.679 |
| Method | MAE | Size | Selected Genes |
| Lasso | 7.64 | 42 | FBLIM1, MT1E, EDN2, F3, FAM102B, S100A14, LAMB3, EPCAM, FN1, TM4SF1, UCHL1, NMU, ANXA3, PLAC8, SPP1, TGFBI, CD74, GPX3, EDN1, CPVL, NPTX2, TES, AKR1B10, CA2, TSPYL5, MAL2, GDA, BAMBI, CST6, ADAMTS15, DUSP6, BTG1, LGALS3, IFI27, MEIS2, TOX3, KRT23, BST2, SLPI, PLTP, XIST, NGFRAP1 |
| AHuber | 6.74 | 11 | MT1E, ARHGAP29, CPCAM, VAMP8, MALL, ANXA3, MAL2, BAMBI, LGALS3, KRT19, TFF3 |
| TAHuber | 5.76 | 7 | MT1E, ARHGAP29, MALL, ANXA3, MAL2, BAMBI, KRT19 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Bayesian Methods and Mixture Models · Advanced Statistical Methods and Models
Adaptive Huber Regression††thanks: Qiang Sun is Assistant Professor, Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada (E-mail: [email protected]).
Wen-Xin Zhou is Assistant Professor, Department of Mathematics, University of California, San Diego, La Jolla, CA 92093 (E-mail: [email protected]). Jianqing Fan is Honorary Professor, School of Data Science, Fudan University, Shanghai, China and Frederick L. Moore ’18 Professor of Finance, Department of Operations Research and Financial Engineering, Princeton University, NJ 08544 (E-mail: [email protected]).
Qiang Sun, Wen-Xin Zhou, and Jianqing Fan
Abstract
Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded -th moment for any . We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when , the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime . Furthermore, this transition is smooth and optimal. We extend the methodology to allow both heavy-tailed predictors and observation noise. Simulation studies lend further support to the theory. In a genetic study of cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown to be more robust and predictive.
Keywords: Adaptive Huber regression, bias and robustness tradeoff, finite-sample inference, heavy-tailed data, nonasymptotic optimality, phase transition.
1 Introduction
Modern data acquisitions have facilitated the collection of massive and high dimensional data with complex structures. Along with holding great promises for discovering subtle population patterns that are less achievable with small-scale data, big data have introduced a series of new challenges to data analysis both computationally and statistically (Loh and Wainwright, 2015; Fan et al., 2018). During the last two decades, extensive progress has been made towards extracting useful information from massive data with high dimensional features and sub-Gaussian tails111A random variable is said to have sub-Gaussian tails if there exists constants and such that for any . (Tibshirani, 1996; Fan and Li, 2001; Efron et al., 2004; Bickel, Ritov and Tsybakov, 2009). We refer to the monographs, Bühlmann and van de Geer (2011) and Hastie, Tibshirani and Wainwright (2015), for a systematic coverage of contemporary statistical methods for high dimensional data.
The sub-Gaussian tails requirement, albeit being convenient for theoretical analysis, is not realistic in many practical applications since modern data are often collected with low quality. For example, a recent study on functional magnetic resonance imaging (fMRI) (Eklund, Nichols and Knutsson, 2016) shows that the principal cause of invalid fMRI inferences is that the data do not follow the assumed Gaussian shape, which speaks to the need of validating the statistical methods being used in the field of neuroimaging. In a microarray data example considered in Wang, Peng and Li (2015), it is observed that some gene expression levels have heavy tails as their kurtosises are much larger than 3, despite of the normalization methods used. In finance, the power-law nature of the distribution of returns has been validated as a stylized fact (Cont, 2001). Fan et al. (2016) argued that heavy-tailed distribution is a stylized feature for high dimensional data and proposed a shrinkage principle to attenuate the influence of outliers. Standard statistical procedures that are based on the method of least squares often behave poorly in the presence of heavy-tailed data222We say a random variable has heavy tails if decays to zero polynomially in as . (Catoni, 2012). It is therefore of ever-increasing interest to develop new statistical methods that are robust against heavy-tailed errors and other potential forms of contamination.
In this paper, we first revisit the robust regression that was initiated by Peter Huber in his seminal work Huber (1973). Asymptotic properties of the Huber estimator have been well studied in the literature. We refer to Huber (1973), Yohai and Maronna (1979), Portnoy (1985), Mammen (1989) and He and Shao (1996, 2000) for an unavoidably incomplete overview. However, in all of the aforementioned papers, the robustification parameter is suggested to be set as fixed according to the 95% asymptotic efficiency rule. Thus, this procedure can not estimate the model-generating parameters consistently when the sample distribution is asymmetric.
From a nonasymptotic perspective (rather than an asymptotic efficiency rule), we propose to use the Huber regression with an adaptive robustification parameter, which is referred to as the adaptive Huber regression, for robust estimation and inference. Our adaptive procedure achieves the nonasymptotic robustness in the sense that the resulting estimator admits exponential-type concentration bounds when only low-order moments exist. Moreover, the resulting estimator is also an asymptotically unbiased estimate for the parameters of interest. In particular, we do not impose symmetry and homoscedasticity conditions on error distributions, so that our problem is intrinsically different from median/quantile regression models, which are also of independent interest and serve as important robust techniques (Koenker, 2005).
We made several major contributions towards robust modeling in this paper. First and foremost, we establish nonasymptotic deviation bounds for adaptive Huber regression when the error variables have only finite -th moments. By providing a matching lower bound, we observe a sharp phase transition phenomenon, which is in line with that discovered by Devroye et al. (2016) for univariate mean estimation. Second, a similar phase transition for regularized adaptive Huber regression is established in high dimensions. By defining the effective dimension and effective sample size, we present nonasymptotic results under the two different regimes in a unified form. Last, by exploiting the localized analysis developed in Fan et al. (2018), we remove the artificial bounded parameter constraint imposed in previous works; see Loh and Wainwright (2015) and Fan, Li and Wang (2017). In the supplementary material, we present a nonasymptotic Bahadur representation for the adaptive Huber estimator when , which provides a theoretical foundation for robust finite-sample inference.
The rest of the paper proceeds as follows. The rest of this section is devoted to related literature. In Section 2, we revisit the Huber loss and robustification parameter, followed by the proposal of adaptive Huber regression in both low and high dimensions. We sharply characterize the nonasymptotic performance of the proposed estimators in Section 3. We describe the algorithm and implementation in Section 5. Section 6 is devoted to simulation studies and a real data application. In Section 4, we extend the methodology to allow possibly heavy-tailed covariates/predictors. All the proofs are collected in the supplemental material.
1.1 Related Literature
The terminology “robustness” used in this paper describes how stable the method performs with respect to the tail-behavior of the data, which can be either sub-Gaussian/sub-exponential or Pareto-like (Delaigle, Hall and Jin, 2011; Catoni, 2012; Devroye et al., 2016). This is different from the conventional perspective of robust statistics under Huber’s -contamination model (Huber, 1964), for which a number of depth-based procedures have been developed since the groundbreaking work of John Tukey (Tukey, 1975). Significant contributions have also been made in Liu (1990), Liu, Parelius, and Singh (1999), Zuo and Serfling (2000), Mizera (2002) and Mizera and Müller (2004). We refer to Chen, Gao and Ren (2018) for the most recent result and a literature review concerning this problem.
Our main focus is on the conditional mean regression in the presence of heavy-tailed and asymmetric errors, which automatically distinguishes our method from quantile-based robust regressions (Koenker, 2005; Belloni and Chernozhukov, 2011; Wang, 2013; Fan, Fan and Barut, 2014; Zheng, Peng and He, 2015). In general, quantile regression is biased towards estimating the mean regression coefficient unless the error distributions are symmetric around zero. Another recent work that is related to ours is Alquier, Cottett and Lecué (2017). They studied a general class of regularized empirical risk minimization procedures with a particular focus on Lipschitz losses, which includes the quantile, hinge and logistic losses. Different from all these work, our goal is to estimate the mean regression coefficients robustly. The robustness is witnessed by a nonasymptotic analysis: the proposed estimators achieve sub-Gaussian deviation bounds when the regression errors have only finite second moments. Asymptotically, our proposed estimators are fully efficient: they achieve the same efficiency as the ordinary least squares estimators.
An important step towards estimation under heavy-tailedness has been made by Catoni (2012), whose focus is on estimating a univariate mean. Let be a real-valued random variable with mean and variance , and assume that are independent and identically distributed (i.i.d.) from . For any prespecified exception probability , Catoni constructs a robust mean estimator that deviates from the true mean logarithmically in , that is,
[TABLE]
while the empirical mean deviates from the true mean only polynomially in , namely subGaussian tails versus Cauchy tail in terms of . Further, Devroye et al. (2016) developed adaptive sub-Gaussian estimators that are independent of the prespecified exception probability. Beyond mean estimation, Brownlees, Joly and Lugosi (2015) extended Catoni’s idea to study empirical risk minimization problems when the losses are unbounded. Generalizations of the univariate results to those for matrices, such as the covariance matrices, can be found in Catoni (2016), Minsker (2018), Giulini (2017) and Fan, Li and Wang (2017). Fan, Li and Wang (2017) modified Huber’s procedure (Huber, 1973) to obtain a robust estimator, which is concentrated around the true mean with exponentially high probability in the sense of (1), and also proposed a robust procedure for sparse linear regression with asymmetric and heavy-tailed errors.
Notation: We fix some notations that will be used throughout this paper. For any vector and , is the norm. For any vectors , we write . Moreover, we let denote the number of nonzero entries of , and set . For two sequences of real numbers and , denotes for some constant independent of , if , and if and . For two scalars, we use to denote the minimum of and . If is an matrix, we use to denote its spectral norm, defined by , where is the unit sphere in . For an matrix , we use and to denote the maximum and minimum eigenvalues of , respectively. For two matrices and , we write if is positive semi-definite. For a function , we use to denote its gradient vector as long as it exists.
2 Methodology
We consider i.i.d. observations that are generated from the following heteroscedastic regression model
[TABLE]
Assuming that the second moments are bounded (), the standard ordinary least squares (OLS) estimator, denoted by , admits a suboptimal polynomial-type deviation bound, and thus does not concentrate around tightly enough for large-scale simultaneous estimation and inference. The key observation that underpins this suboptimality of the OLS estimator is the sensitivity of quadratic loss to outliers (Huber, 1973; Catoni, 2012), while the Huber regression with a fixed tuning constant may lead to nonnegligible estimation bias. To overcome this drawback, we propose to employ the Huber loss with an adaptive robustification parameter to achieve robustness and (asymptotic) unbiasedness simultaneously. We begin with the definitions of the Huber loss and the corresponding robustification parameter.
Definition 1** (Huber Loss and Robustification Parameter).**
The Huber loss (Huber, 1964) is defined as
[TABLE]
where is referred to as the robustification parameter that balances bias and robustness (Fan, Li and Wang, 2017).
The loss function is quadratic for small values of , and becomes linear when exceeds in magnitude. The parameter therefore controls the blending of quadratic and losses, which can be regarded as two extremes of the Huber loss with and , respectively. Comparing with the least squares, outliers are down weighted in the Huber loss. We will use the name, adaptive Huber loss, to emphasize the fact that the parameter should adapt to the sample size, dimension and moments for a better tradeoff between bias and robustness. This distinguishes our framework from the classical setting. As is needed to reduce the bias when the error distribution is asymmetric, this loss is also called the RA-quadratic (robust approximation to quadratic) loss in Fan, Li and Wang (2017).
Define the empirical loss function for . The Huber estimator is defined through the following convex optimization problem:
[TABLE]
In low dimensions, under the condition that for some , we will prove that with (the first factor is kept in order to show its explicit dependence on the moment) achieves the tight upper bound . The phase transition at can be easily observed (see Figure 1). When higher moments exist (), robustification leads to a sub-Gaussian-type deviation inequality in the sense of (1).
In the high dimensional regime, we consider the following regularized adaptive Huber regression with a different choice of the robustification parameter:
[TABLE]
where and with Let be the size of the true support . We will show that the regularized Huber estimator achieves an upper bound that is of the order for estimating in -error with high probability.
To unify the nonasymptotic upper bounds in the two different regimes, we define the effective dimension, , to be in low dimensions and in high dimensions. In other words, denotes the number of nonzero parameters of the problem. The effective sample size, , is defined as and in low and high dimensions, respectively. We will establish a phase transition: when , the proposed estimator enjoys a sub-Gaussian concentration, while it only achieves a slower concentration when . Specifically, we show that, for any , the proposed estimators with achieve the following tight upper bound, up to logarithmic factors:
[TABLE]
This finding is summarized in Figure 1.
3 Nonasymptotic Theory
3.1 Adaptive Huber Regression with Increasing Dimensions
We begin with the adaptive Huber regression in the low dimensional regime. First, we provide an upper bound for the estimation bias of Huber regression. We then establish the phase transition by establishing matching upper and lower bounds on the -error. The analysis is carried out under both fixed and random designs. The results under random designs are provided in the supplementary material. We start with the following regularity condition.
Condition 1**.**
The empirical Gram matrix is nonsingular. Moreover, there exist constants and such that .
For any , given in (3) is natural -estimator of
[TABLE]
where the expectation is taken over the regression errors. We call the Huber regression coefficient, which is possibly different from the vector of true parameters . The estimation bias, measured by , is a direct consequence of robustification and asymmetric error distributions. Heuristically, choosing a sufficiently large reduces bias at the cost of losing robustness (the extreme case of corresponds to the least squares estimator). Our first result shows how the magnitude of affects the bias . Recall that with .
Proposition 1**.**
Assume Condition 1 holds and that is finite for some . Then, the vector of Huber regression coefficients satisfies
[TABLE]
provided for or for , where
The total estimation error can therefore be decomposed into two parts
[TABLE]
where the approximation bias is of order . A large reduces the bias but compromises the degree of robustness. Thus an optimal estimator is the one with diverging at a certain rate to achieve the optimal tradeoff between estimation error and approximation bias. Our next result presents nonasymptotic upper bounds on the -error with an exponential-type exception probability, when is properly tuned. Recall that for any .
Theorem 1** (Upper Bound).**
Assume Condition 1 holds and for some . Let and assume for some depending only on and . Then, for any and , the estimator with satisfies the bound
[TABLE]
with probability at least .
Remark 1**.**
It is worth mentioning that the proposed robust estimator depends on the unknown parameter . Adaptation to the unknown moment is indeed another important problem. In Section 6, we suggest a simple cross-validation scheme for choosing with desirable numerical performance. A general adaptive construction of can be obtained via Lepski’s method (Lepski, 1991), which is more challenging due to unspecified constants. In the supplementary material, we discuss a variant of Lepski’s method and establish its theoretical guarantee.
Remark 2**.**
We do not assume to be a constant, and hence the proposed method accommodates heteroscedastic regression models. For example, can take the form of , where is a positive function, and are random variables satisfying and .
Remark 3**.**
We need the scaling condition to go roughly as under fixed designs. With random designs, we show that the scaling condition can be relaxed to . Details are given in the supplementary material.
Theorem 1 indicates that, with only bounded -th moment, the adaptive Huber estimator achieves the upper bound , up to a logarithmic factor, by setting . A natural question is whether the upper bound in (8) is optimal. To address this, we provide a matching lower bound up to a logarithmic factor. Let be the class of all distributions on whose -th absolute central moment equals Let be the design matrix and
Theorem 2** (Lower Bound).**
Assume that the regression errors are i.i.d. from a distribution in with . Suppose there exists a such that for some . Then, for any and any estimator possibly depending on , we have
[TABLE]
where .
Theorem 2 reveals that root- consistency with exponential concentration is impossible when . It widens the phenomenon observed in Theorem 3.1 in Devroye et al. (2016) for estimating a mean. In addition to the eigenvalue assumption, we need to assume that there exists a such that the minimum angle between and is non-vanishing. This assumption comes from the intuition that the linear subspace spanned by is at most of rank and thus cannot span the whole space This assumption naturally holds in the univariate case where and we can take and . More generally, . Taking for an example, since , we can assume that each coordinate of is positive. In this case, , which is strictly positive with probability one, assuming is drawn from a continuous distribution.
Together, the upper and lower bounds show that the adaptive Huber estimator achieves near-optimal deviations. Moreover, it indicates that the Huber estimator with an adaptive exhibits a sharp phase transition: when , converges to at the parametric rate , while only a slower rate of order is available when the second moment does not exist.
Remark 4**.**
We provide a parallel analysis under random designs in the supplementary material. Beyond the nonasymptotic deviation bounds, we also prove a nonasymptotic Bahadur representation, which establishes a linear approximation of the nonlinear robust estimator. This result paves the way for future research on conducting statistical inference and constructing confidence sets under heavy-tailedness. Additionally, the proposed estimator achieves full efficiency: it is as efficient as the ordinary least squares estimator asymptotically, while the robustness is characterized via nonasymptotic performance.
3.2 Adaptive Huber Regression in High Dimensions
In this section, we study the regularized adaptive Huber estimator in high dimensions where is allowed to grow with the sample size exponentially. The analysis is carried out under fixed designs, and results for random designs are again provided in the supplementary material. We start with a modified version of the localized restricted eigenvalue introduced by Fan et al. (2018). Let denote the Hessian matrix. Recall that is the true support set with .
Definition 2** (Localized Restricted Eigenvalue, LRE).**
The localized restricted eigenvalue of is defined as
[TABLE]
where is a local -cone.
The LRE is defined in a local neighborhood of under -norm. This facilitates our proof, while Fan et al. (2018) use the -norm.
Condition 2**.**
satisfies the localized restricted eigenvalue condition , that is, for some constants .
The condition above is referred to as the LRE condition (Fan et al., 2018). It is a unified condition for studying generalized loss functions, whose Hessians may possibly depend on . For Huber loss, Condition 2 also involves the observation noise. The following definition concerns the restricted eigenvalues of instead of .
Definition 3** (Restricted Eigenvalue, RE).**
The restricted maximum and minimum eigenvalues of are defined respectively as
[TABLE]
where .
Condition 3**.**
satisfies the restricted eigenvalue condition , that is, for some constants .
To make Condition 2 on practically useful, in what follows, we show that Condition 3 implies Condition 2 with high probability. As before, we write and .
Lemma 1**.**
Condition 3 implies Condition 2 with high probability: if for some and , then it holds with probability at least that, provided and , where are constants depending only on .
With the above preparations in place, we are now ready to present the main results on the adaptive Huber estimator in high dimensions.
Theorem 3** (Upper Bound in High Dimensions).**
Assume Condition 3 holds with , for some . For any and , let , , and . Then with probability at least , the -regularized Huber estimator defined in (4) satisfies
[TABLE]
as long as for some depending only on . In particular, with for we have
[TABLE]
with probability at least .
The above result demonstrates that the regularized Huber estimator with an adaptive robustification parameter converges at the rate with overwhelming probability. Provided the observation noise has finite variance, the proposed estimator performs as well as the Lasso with sub-Gaussian errors. We advocate the adaptive Huber regression method since sub-Gaussian condition often fails in practice (Wang, Peng and Li, 2015; Eklund, Nichols and Knutsson, 2016).
Remark 5**.**
As pointed out by a reviewer, if one pursues a sparsity-adaptive approach, such as the SLOPE (Bogdan et al., 2015; Bellec et al., 2018), the upper bound on -error can be improved from to . With heavy-tailed observation noise, it is interesting to investigate whether this sharper bound can be achieved by Huber-type regularized estimator. We leave this to future work as a significant amount of additional work is still needed. On the other hand, since and , scales the same as so long as for some .
Remark 6**.**
Analogously to the low dimensional case, here we impose the sample size scaling under fixed designs. In the supplementary material, we obtain minimax optimal -, - and prediction error bounds for with random designs under the scaling .
Finally, we establish a matching lower bound for estimating . Recall the definition of in Theorem 2.
Theorem 4** (Lower Bound in High Dimensions).**
Assume that are independent from some distribution in . Suppose that Condition 3 holds with and . Further assume that there exists a set with and such that for some . Then, for any and -sparse estimator possibly depending on , we have
[TABLE]
as long as .
Together, Theorems 3 and 4 show that the regularized adaptive Huber estimator achieves the optimal rate of convergence in -error. The proof, which is given in the supplementary material, involves constructing a sub-class of binomial distributions for the regression errors. Unifying the results in low and high dimensions, we arrive at the claim (5) and thus the phase transition in Figure 1.
4 Extension to Heavy-tailed Designs
In this section, we extend the idea of adaptive Huber regression described in Section 2 to the case where both the covariate vector and the regression error exhibit heavy tails. We focus on the high dimensional regime , where is sparse with . Observe that, for Huber regression, the linear part of the Huber loss penalizes the residuals, and therefore robustifies the quadratic loss in the sense that outliers in the response space (caused by heavy-tailed observation noise) are down weighted or removed. Since no robustification is imposed on the covariates, intuitively, the adaptive Huber estimator may not be robust against heavy-tailed covariates. In what follows, we modify the adaptive Huber regression to robustify both the covariates and regression errors.
To begin with, suppose we observe independent data from , which follows the linear model . To robustify , we define truncated covariates , where and is a tuning parameter. Then we consider the modified adaptive Huber estimator (see Fan et al. (2016) for a general robustification principle)
[TABLE]
where and is a regularization parameter.
Let be the true support of with sparsity , and denote by the Hessian matrix of the modified Huber loss. To investigate the deviation property of , we impose the following mild moment assumptions.
Condition 4**.**
(i) , and ; (ii) The covariate vector is independent of and satisfies .
We are now in place to state the main result of this section. Theorem 5 below demonstrates that the modified adaptive Huber estimator admits exponentially fast concentration when the convariates only have finite fourth moments, although at the cost of stronger scaling conditions.
Theorem 5**.**
Assume Condition 4 holds and let satisfy Condition 2 with , and . Then, the modified adaptive Huber estimator given in (11) satisfies, on the event \mathcal{E}(\tau,\varpi,\lambda)=\big{\{}\|(\nabla\mathcal{L}^{\varpi}_{\tau}(\bm{\beta}^{*}))_{{\mathcal{S}}}\|_{\infty}\leq\lambda/2\big{\}}, that
[TABLE]
For any , let the triplet satisfy
[TABLE]
where and . Then .
Remark 7**.**
Assume that the quantities , and are all bounded. Taking in (12), we see that achieves a near-optimal convergence rate of order when the parameters scale as
[TABLE]
We remark here that the theoretically optimal is different from that in the sub-Gaussian design case. See Theorem B.2 in the supplementary material.
5 Algorithm and Implementation
This section is devoted to computational algorithm and numerical implementation. We focus on the regularized adaptive Huber regression in (4), as (3) can be easily solved via the iteratively reweighted least squares method. To solve the convex optimization problem in (4), standard optimization algorithms, such as the cutting-plane or interior point method, are not scalable to large-scale problems.
In what follows, we describe a fast and easily implementable method using the local adaptive majorize-minimization (LAMM) principle (Fan et al., 2018). We say that a function majorizes at the point if
[TABLE]
To minimize a general function , a majorize-minimization (MM) algorithm initializes at , and then iteratively computes for . The objective value of such an algorithm decreases in each step, since
[TABLE]
As pointed out by Fan et al. (2018), the majorization requirement only needs to hold locally at when starting from . We therefore locally majorize in (4) at by an isotropic quadratic function
[TABLE]
where is a quadratic parameter such that . The isotropic form also allows a simple analytic solution to the subsequent majorized optimization problem:
[TABLE]
It can be shown that (14) is minimized at
[TABLE]
where is the soft-thresholding operator defined by . The simplicity of this updating rule is due to the fact that (14) is an unconstrained optimization problem.
To find the smallest such that , the basic idea of LAMM is to start from a relatively small isotropic parameter and then successfully inflate by a factor , say . If the solution satisfies , we stop and obtain , which makes the target value non-increasing. We then continue with the iteration to produce next solution until the solution sequence converges. A simple stopping criterion is for a sufficiently small , say . We refer to Fan et al. (2018) for a detailed complexity analysis of the LAMM algorithm.
6 Numerical Studies
6.1 Tuning Parameter and Finite Sample Performance
For numerical studies and real data analysis, in the case where the actual order of moments is unspecified, we presume the variance is finite and therefore choose robustification and regularization parameters as follows:
[TABLE]
where with serves as a crude preliminary estimate of , and the parameter controls the confidence level. We set for simplicity except for the phase transition plot. The constant and are chosen via 3-fold cross-validation from a small set of constants, say .
We generate data from the linear model
[TABLE]
where are i.i.d. regression errors and Independent of , we generate from standard multivariate normal distribution . In this section, we set , and generate regression errors from three different distributions: the normal distribution , the -distribution with degrees of freedom 1.5, and the log-normal distribution . Both and log-normal distributions are heavy-tailed, and produce outliers with high chance.
The results on -error for adaptive Huber regression and the least squares estimator, averaged over 100 simulations, are summarized in Table 1. In the case of normally distributed noise, the adaptive Huber estimator performs as well as the least squares. With heavy-tailed regression errors following Student’s or log-normal distribution, the adaptive Huber regression significantly outperforms the least squares. These empirical results reveal that adaptive Huber regression prevails across various scenarios: not only it provides more reliable estimators in the presence of heavy-tailed and/or asymmetric errors, but also loses almost no efficiency at the normal model.
6.2 Phase Transition
In this section, we validate the phase transition behavior of empirically. We generate continuous responses according to (15), where and are set the same way as before. We sample independent errors as , Student’s -distribution with df degrees of freedom. Note that has finite -th moments provided and infinite df-th moment. Therefore, we take throughout.
In low dimensions, we take and a sequence of degrees of freedoms (df’s): ; in high dimensions, we take , with the same choice of df’s. Tuning parameters are calibrated similarly as before. Indicated by the main theorems, it holds
(Low dimension):
[TABLE] 2. 2.
(High dimension):
[TABLE]
which are approximately and , respectively, when is sufficiently large.
Figure 2 displays the negative -error versus in both low and high dimensions over 200 repetitions for each combination. The empirically fitted curve closely resembles the theoretical curve displayed in Figure 1. These numerical results are in line with the theoretical findings, and empirically validate the phase transition of the adaptive Huber estimator.
We also compared the -error of the adaptive Huber estimator with that of the OLS estimator for -distributed errors with varying degrees of freedoms. As shown in Figure 3, adaptive Huber exhibits a significant advantage especially when is small. The OLS slowly catches up as increases.
6.3 Effective Sample Size
In this section, we verify the scaling behavior of with respect to the effective sample size. The data are generated in the same way as before except that the errors are drawn from . As discussed in the previous subsection, we take and then choose the robustification parameter as where is the -th sample absolute central moment. For simplicity, we take here since our goal is to demonstrate the scaling behavior as grows, instead of to achieve the best finite-sample performance.
The left panel of Figure 4 plots the -error versus sample size over 200 repetitions when the dimension . In all three settings, the -error decays as the sample size grows. As expected, the curves shift to the right when the dimension increases. Theorem 3 provides a specific prediction about this scaling behavior: if we plot the -error versus effective sample size (), the curves should align roughly with the theoretical curve
[TABLE]
for different values of . This is validated empirically by the right panel of Figure 4. This near-perfect alignment in Figure 4 is also observed by Wainwright (2009) for Lasso with sub-Gaussian errors.
6.4 A Real Data Example: NCI-60 Cancer Cell Lines
We apply the proposed methodologies to the NCI-60, a panel of 60 diverse human cancel cell lines. The NCI-60 consists of data on 60 human cancer cell lines and can be downloaded from http://discover.nci.nih.gov/cellminer/. More details on data acquisition can be found in Shankavaram et al. (2007). Our aim is to investigate the effects of genes on protein expressions. The gene expression data were obtained with an Affymetrix HG-U133A/B chip, transformed and normalized with the guanine dytosine robust multi-array analysis. We then combined the same gene expression variables measured by multiple different probes into one by taking their median, resulting in a set of predictors. The protein expressions based on 162 antibodies were acquired via reverse-phase protein lysate arrays in their original scale. One observation had to be removed since all values were missing in the gene expression data, reducing the number of observations to .
We first center all the protein and gene expression variables to have mean zero, and then plot the histograms of the kurtosises of all expressions in Figure 5. The left panel in the figure shows that, 145 out of 162 protein expressions have kurtosises larger than 3; and 49 larger than 9. In other words, more than 89.5% of the protein expression variables have tails heavier than the normal distribution, and about 30.2% are severely heavy-tailed with tails flatter than , the -distribution with 5 degrees of freedom. Similarly, about 36.5% of the gene expression variables, even after the -transformation, still exhibit empirical kurtosises larger than that of . This suggests that, regardless of the normalization methods used, genomic data can still exhibit heavy-tailedness, which was also pointed out by Purdom and Holmes (2005).
We order the protein expression variables according to their scales, measured by the standard deviation. We show the results for the protein expressions based on the KRT19 antibody, the protein keratin 19, which constitutes the variable with the largest standard deviation, serving as one dependent variable. KRT19, a type I keratin, also known as Cyfra 21-1, is encoded by the KRT19 gene. Due to its high sensitivity, the KRT19 antibody is the most used marker for the tumor cells disseminated in lymph nodes, peripheral blood, and bone marrow of breast cancer patients (Nakata et al., 2004). We denote the adaptive Huber regression as AHuber, and that with truncated covariates as TAHuber. We then compare AHuber and TAHuber with Lasso. Both regularization and robustification parameters are chosen by the ten-fold cross-validation.
To measure the predictive performance, we consider a robust prediction loss: the mean absolute error (MAE) defined as
[TABLE]
where and , , denote the observations of the response and predictor variables in the test data, respectively. We report the MAE via the leave-one-out cross-validation. Table 2 reports the MAE, model size and selected genes for the considered methods. TAHuber clearly shows the smallest MAE, followed by AHuber and Lasso. The Lasso produces a fairly large model despite the small sample. Now it has been recognized that Lasso tends to select many noise variables along with the significant ones, especially when data exhibit heavy tails.
The Lasso selects a model with 42 genes but excludes the KRT19 gene, which encodes the protein keratin 19. AHuber finds 11 genes including KRT19. TAHuber results in a model with 7 genes: KRT19, MT1E, ARHGAP29, MALL, ANXA3, MAL2, BAMBI. First, KRT19 encodes the keratin 19 protein. It has been reported in Wu et al. (2008) that the MT1E expression is positively correlated with cancer cell migration and tumor stage, and the MT1E isoform was found to be present in estrogen receptor-negative breast cancer cell lines (Friedline et al., 1998). ANXA3 is highly expressed in all colon cell lines and all breast-derived cell lines positive for the oestrogen receptor (Ross et al., 2000). A very recent study in Zhou et al. (2017) suggested that silencing the ANXA3 expression by RNA interference inhibits the proliferation and invasion of breast cancer cells. Moreover, studies in Shangguan et al. (2012) and Kretzschmar (2000) showed that the BAMBI transduction significantly inhibited TGF-/Smad signaling and expression of carcinoma-associated fibroblasts in human bone marrow mesenchymal stem cells (BM-MSCs), and disrupted the cytokine network mediating the interaction between MSCs and breast cancer cells. Consequently, the BAMBI transduction abolished protumor effects of BM-MSCs in vitro and in an orthotopic breast cancer xenograft model, and instead significantly inhibited growth and metastasis of coinoculated cancer. MAL2 expressions were shown to be elevated at both RNA and protein levels in breast cancer (Shehata et al., 2008). It has also been shown that MALL is associated with various forms of cancer (Oh et al., 2005; Landi et al., 2014). However, the effect of ARHGAP29 and MALL on breast cancer remains unclear and is worth further investigation.
Supplementary Materials
In the supplementary materials, we provide theoretical analysis under random designs, and proofs of all the theoretical results in this paper.
Acknowledgments
The authors thank the Editor, Associate Editor, and two anonymous referees for their valuable comments. This work is supported by a Connaught Award, NSERC Grant RGPIN-2018-06484, NSF Grants DMS-1662139, DMS-1712591, and DMS-1811376, NIH Grant 2R01-GM072611-14, and NSFC Grant 11690014.
Appendix
Appendix A A Lepski-type method
Adapting the unknown robustification parameter depends on the value of the variance provided it exists. Through Lepski’s renowned adaptation method (Lepski, 1991), this can be done without actually knowing the variance in advance. Assume that and let be such that Here, parameters and serve as crude preliminary upper and lower bounds for , respectively.
For a prespecified , let and define the set
[TABLE]
with its cardinality satisfying . For every predetermined , compute a collection of Huber estimators , where for . Set
[TABLE]
where assuming is positive definite. The final data-driven estimator is then defined as .
Theorem 6**.**
For any , the data-dependent estimator satisfies the bound
[TABLE]
with probability at least , provided the sample size satisfies .
Lepski-type construction relies on preliminary crude upper and lower bounds for , which are usually unknown in advance. In practice, one can take and for some , where and is the least squares estimator. Moreover, one may choose and or . However, the effectiveness of this method depends on how sharp the constants are in the theoretical bounds. We note that all constants in Theorems 1 and 6 are explicit, although they might not be sharp. Finding sharp constants remains open. Since the current content already consists of long and technical arguments, we will not pursue this particular goal in this paper.
Proof of Theorem 6.
Following the proof of Theorem 1 which is given in Appendix C, it can be similarly proved that, for any with ,
[TABLE]
with probability at least as long as .
Let and note that . By the definition of ,
[TABLE]
Define the event
[TABLE]
such that . From (17) we see that for each ,
[TABLE]
with probability at least under the prescribed sample size scaling. By the union bound, we obtain that
[TABLE]
On the event , and thus
[TABLE]
Together, the last two displays yield (16). ∎
Appendix B Random Design Analysis
In this section, we derive counterparts of the results in Section 3 under random designs. First we impose the following moment conditions on the covariates and regression errors.
Condition 5**.**
In linear model (2), the covariate vectors are i.i.d. from a sub-Gaussian random vector , i.e. for all and , where with being positive definite and is a constant. The regression errors are independent and satisfy and almost surely for some .
Throughout this section, for simplicity, we assume the independent regression errors in model (2) are homoscedastic in the sense that does not depend on . The conditional heteroscedastic model can be allowed with slight modifications as before. With this setup, we write
[TABLE]
Assuming the matrix is positive definite, we use to denote the rescaled -norm on :
[TABLE]
Moreover, we use to denote the derivative of Huber loss, that is,
[TABLE]
B.1 Huber regression in low dimensions
In the low dimensional regime “”, we consider the Huber estimator
[TABLE]
where is the empirical Huber loss function and is the robustification parameter. Under Condition 5, the following theorem provides (i) exponential-type concentration inequalities for when is properly calibrated, and (ii) a nonasymptotic Bahadur representation result under the finite variance condition on regression errors, i.e. .
Theorem 7**.**
Suppose Condition 5 holds.
- (I)
For any and , the estimator with satisfies
[TABLE]
as long as , where depend only on .
- (II)
Assume that . For any and , the estimator with satisfies
[TABLE]
provided , where depends only on .
With random designs, the first part of Theorem 7 provides concentration inequalities for the -error under finite -th moment conditions with ; when the second moments are finite, the second part gives a finite-sample approximation of by a sum of independent random vectors. The remainder of such an approximation exhibits sub-exponential tails. Unlike the least squares estimator, the adaptive Huber estimator does not admit an explicit closed-form representation, which causes the main difficulty for analyzing its asymptotic and nonasymptotic properties. Theorem 7 reveals that, up to a higher-order remainder, the distributional property of mainly depends on a linear stochastic term that is much easier to deal with.
Regarding the truncated random variable , the following result shows that the differences between the first two moments of and depend on both and the moments of . The higher moment has, the faster these differences decay as a function of . We summarize this observation in the following proposition. We drop for ease of presentation.
Proposition 2**.**
Assume that , and from some . Then we have
[TABLE]
Moreover, if ,
[TABLE]
Proposition 2, along with Theorem 7, shows that the adaptive Huber estimator achieves nonasymptotic robustness against heavy-tailed errors, while enjoying high efficiency when diverges to . In particular, taking , we see that under the scaling , the robust estimator with satisfies
[TABLE]
with probability at least . From an asymptotic point of view, this implies that if the dimension , as a function of , satisfies
[TABLE]
then for any deterministic vector , the distribution of is close to that of . If are independent from with variance and for some , taking in Proposition 2 implies that follows a normal distribution with mean zero and variance asymptotically.
B.2 Huber regression in high dimensions
In the high dimensional setting where and , we investigate the -regularized Huber estimator
[TABLE]
under Condition 5, where and represent, respectively, the robustification and regularization parameters.
Theorem 8**.**
Assume Condition 5 holds and that the unknown is sparse with . Then any optimal solution to the convex program (21) with
[TABLE]
and scaling as satisfies the bounds
[TABLE]
with probability at least as long as , where is a constant only depending on , and .
Provided the distribution of has finite variance, i.e. , Theorem 8 asserts that the -regularized Huber regression with properly tuned gives rise to statistically consistent estimators with - and -errors scaling as and , respectively, under the sample size scaling . These rates are the minimax rates enjoyed by the standard Lasso with Gaussian/sub-Gaussian errors (Bickel, Ritov and Tsybakov, 2009; Wainwright, 2009).
The results of Theorem 8 are useful complements to those in Theorem 4 under fixed designs. Taking therein, we see that the -error bound in (10) almost coincides with that in (23) up to constant factors. The sample size scaling under random designs is optimal and better than the scaling under fixed designs: the former is of order , while the latter is of order . Technically, the sample size scaling is required to ensure the restricted strong convexity of Huber loss in a neighborhood of ; see Lemma 1 in the main text and Lemma 5 below. Since most existing works on analyzing high dimensional -estimators beyond the least squares have focused on random designs (see, e.g. Belloni and Chernozhukov (2011), Negahban et al. (2012) and the references therein), it is not clear what the optimal sample size scaling is under fixed designs, although it is possible that the additional factor in Theorem 4 is purely an artifact of the proof technique. We refer to van de Geer (2008) for a study of generalized linear models in high dimensions. To achieve the oracle rate for the excess risk, the sparsity is required to be of order , or equivalently, the required sample size scales as .
We complete this section by a prediction error bound for , which is a direct consequence of Theorem 8.
Corollary 1**.**
Under the conditions of Theorem 8, it holds
[TABLE]
with probability at least , where is the design matrix.
Appendix C Proofs of Main Theorems
Throughout the proofs, we use as in definition (18) and let be the rescaled -norm on given by for .
C.1 Auxiliary Lemmas
First we collect several auxiliary lemmas. Our first lemma concerns the localized analysis that can be utilized to remove the parameter constraint in previous works. It is established in Fan et al. (2018) and we reproduce it here for completeness.
Lemma 2**.**
Let and . For with and any convex loss functions , we have
[TABLE]
Proof of Lemma 2.
Let . Noting that the derivative of with respect to is , we have
[TABLE]
Then, the symmetric Bregman divergence can be written as
[TABLE]
Taking in the above equation, we have as a special case. If is convex, then is non-decreasing and thus
[TABLE]
It remains to show the convexity of ; or equivalently, the convexity of and , respectively. First, note that , as a function of , is linear in , that is, for all and satisfying . Then, the convexity of follows from this linearity and the convexity of the Huber loss. The convexity of the second term follows directly from the bi-linearity of the inner product. ∎
The following two lemmas provide restricted strong convexity properties for the Huber loss in a local vicinity of the true parameter under both fixed and random designs.
Lemma 3**.**
Assume that Condition 1 holds and that for some . Then for any , the Hessian matrix with satisfies that, with probability greater than ,
[TABLE]
where .
Proof of Lemma 3.
To begin with, note that
[TABLE]
where is given in Condition 1. For each , define its centered and rescaled version such that . Using the inequality that
[TABLE]
we have, for any and satisfying ,
[TABLE]
provided that . For any , it follows from Hoeffding’s inequality that, with probability at least ,
[TABLE]
This, together with the inequality and Condition 1, implies that, with probability at least ,
[TABLE]
This proves (25) immediately by taking . ∎
Lemma 4**.**
Assume for some and for all and some constant . Moreover, let satisfy
[TABLE]
Then with probability at least ,
[TABLE]
uniformly over .
Proof of Lemma 4.
To begin with, note that
[TABLE]
where denotes the indication function of the event
[TABLE]
On , it holds for all . Since for , the right-hand of (28) can be bounded from below by
[TABLE]
To bound the right-hand of (29), the main difficulty is that the indicator function is non-smooth. To deal with this issue, we define the following “smoothed” functions: for any , write
[TABLE]
It is easy to see that the function is -Lipschitz and satisfies
[TABLE]
Together, (28), (29) and (30) imply
[TABLE]
For , define , such that
[TABLE]
for all . In the following, we establish lower and upper bounds for and , respectively, starting with the former.
For , write . By (31) and Markov’s inequality,
[TABLE]
Provided ,
[TABLE]
Next we bound the supremum . Write . Noting that and , we have
[TABLE]
By Theorem 7.3 in Bousquet (2003), for any , satisfies the bound
[TABLE]
with probability at least , where by (30),
[TABLE]
For the expected value , using the symmetrization inequality and the connection between Gaussian complexity and Rademacher complexity, we obtain that , where
[TABLE]
and are i.i.d. standard normal random variables that are independent of . Let be the conditional expectation given . Since is a conditional Gaussian process, for any we have
[TABLE]
Further, taking the expectation with respect to on both sides, (35) remains valid with replaced by . We write as with denoting the first coordinate of and . Recalling , we take so that and . To bound the conditional expectation in (35), we employ the Gaussian comparison theorem as in the proof of Lemma 11 in Loh and Wainwright (2015)
Denote by the conditional variance given . For , write and . By conditional normality, we quickly compute and bound the variance of :
[TABLE]
Using the property for any , we find that
[TABLE]
It follows from the above calculations and the Lipschitz property of that
[TABLE]
Let be i.i.d. standard normal random variables that are independent of all the previous variables, and define a new process
[TABLE]
As an immediate consequence of (36), we have . Therefore, by the Gaussian comparison inequality (Ledoux and Talagrand, 1991),
[TABLE]
where . Taking the expectation with respect to on both sides gives . From this and the unconditional version of (35), we obtain
[TABLE]
Together, (34) with and (37) imply that as long as , with probability at least . Combining this with (29) and (33) proves the stated result. ∎
Recall that . Let be an -cone in , where denotes the support of . As a counterpart of Lemma 4 in high dimensions, Lemma 5 below shows that the adaptive Huber loss satisfies the restricted strong convexity condition over with high probability.
Lemma 5**.**
Assume for some and for all and some constant . Let satisfy
[TABLE]
Then with probability at least ,
[TABLE]
uniformly over .
Proof of Lemma 5.
The proof is based on an argument similar to that in the proof of Lemma 4. With slight abuse of notation, we keep using as the supremum of a random process:
[TABLE]
Provided , it can be shown that
[TABLE]
According to (34), it remains to bound . Following the proof of Lemma 4, it suffices to focus on the (conditional) Gaussian process
[TABLE]
where are i.i.d. standard normal random variables that are independent of all other random variables. For every , it is easy to see that
[TABLE]
implying
[TABLE]
Keep all other statements the same, we obtain
[TABLE]
With , note that
[TABLE]
Since are sub-exponential/sub-gamma random variables, from Corollary 2.6 in Boucheron, Lugosi and Massart (2013) we find that
[TABLE]
Substituting this into (34) and taking , we obtain that with probability at least ,
[TABLE]
for all sufficiently large that scales as up to an absolute constant. This proves (39). ∎
Lemmas 6 and 7 provide concentration inequalities for and , respetively.
Lemma 6**.**
Assume Condition 5 holds with . Then with probability at least ,
[TABLE]
Proof of 6.
Assume without loss of generality that , or equivalently, ; otherwise so that the bound is trivial. To bound , first define the centered random vector
[TABLE]
where . To evaluate the -norm, there exits a -net of the unit sphere in with such that . Under Condition 5, it holds for every that for all . By direct calculations,
[TABLE]
It then follows from Bernstein’s inequality that
[TABLE]
Taking the union bound over , we obtain that with probability at least ,
[TABLE]
Next, for the deterministic part , it is easy to see that
[TABLE]
Combining this and (42) with , we reach the bound (41) which holds with probability at least . ∎
Lemma 7**.**
Assume Condition 5 holds with . Then with probability at least ,
[TABLE]
Proof of Lemma 7.
The proof is based on Bernstein’s inequality and the union bound. Define for such that . For every , note that . Moreover, from the proof of Lemma 6 we see that
[TABLE]
By Bernstein’s inequality, for any it holds
[TABLE]
with probability at least . By the union bound and taking in the last display, we arrive at the stated result. ∎
C.2 Proof of Proposition 1
Define the error vector and function , . By the optimality of and the mean value theorem, we have and thus
[TABLE]
where for some .
Case 1. First we consider the case of . Since , we have and therefore
[TABLE]
Taking , we see that
[TABLE]
Note that
[TABLE]
This, together with the convexity of implies that , where . For the lower bound, note that for all . Putting these upper and lower bounds on together yields
[TABLE]
as a consequence of which . Combining this with (45), we deduce that as long as ,
[TABLE]
This provides a lower bound for the left-hand side of (43). On the other hand, using (44) and Hölder’s inequality to bound the right-hand side of (43), the claim (7) for follows immediately.
Case 2. Next we assume and note that . In this case, we have and . Then, following the same arguments as above, it can be shown that as long as ,
[TABLE]
This proves (7) for and hence completes the proof. ∎
C.3 Proof of Theorem 1
Without loss of generality, we assume throughout the proof; otherwise, and the stated result holds trivially. For simplicity, we write . Note that for any prespecified , we can construct an intermediate estimator, denoted by , such that . To see that, we take if ; otherwise, we can always choose some so that . Applying Lemma 2 gives
[TABLE]
where according to the Karush-Kuhn-Tucker condition. By the mean value theorem for vector-valued functions, we have
[TABLE]
If, there exists some such that
[TABLE]
then we have . Canceling the common factor on both sides yields
[TABLE]
Define the random vector , which can be written as
[TABLE]
By definition (18), . We write for , such that . With , it is easy to see that the function satisfies
[TABLE]
for all . It follows that
[TABLE]
This, together with the inequality for and , implies
[TABLE]
Consequently, we have
[TABLE]
where we used the inequality in the last step. For any , using Markov’s inequality gives
[TABLE]
As long as , we have . On the other hand, it can be similarly shown that . For any , taking in these two inequalities yields that as long as ,
[TABLE]
Taking , it follows from Lemma 3 and the definition of that with probability at least , (48) holds with provided
[TABLE]
Combining (49) and (51) implies that, with probability at least ,
[TABLE]
Provided , the intermediate estimator will lie in the interior of the ball with radius . By our construction in the beginning of the proof, this enforces and thus . ∎
C.4 Proof of Theorem 2
We start by defining a simple class of distributions for the response variable as \mathcal{P}_{c,\gamma}=\big{\{}\mathbb{P}_{c+},\mathbb{P}_{c-}\big{\}}, where
[TABLE]
Here, we suppress the dependence of and on for convenience. It follows that, for any , the -th absolute central moment of with law either or is
[TABLE]
For , let be independent pairs of real-valued random variables satisfying
[TABLE]
Let for , and Taking with , we obtain and
[TABLE]
By assumption, we know that there is an -dimensional vector with each coordinate taking or such that . Note that this assumption naturally holds for the mean model, where and can be taken as . Now we take , and such that for a , and , which indicates that
[TABLE]
Let be any estimator possibly depending on , then the above calculation yields
[TABLE]
where we suppress the dependence of on for simplicity. Using the fact that further implies
[TABLE]
Now since , taking implies the result for the case where . When , the second moment exists, and therefore using the fact that completes the proof. ∎
C.5 Proof of Theorem 3
We start with the proof of Lemma 1.
Proof of Lemma 1.
Let where we suppress the dependence on . Then for any , we have
[TABLE]
As for any , we have
[TABLE]
Moreover, for any , applying Hoeffding’s inequality yields that, with probability at least ,
[TABLE]
Putting the above calculations together, we obtain
[TABLE]
Consequently, as long as , the following inequality
[TABLE]
holds uniformly over with probability at least , where the last inequality in (55) holds whenever and . On the other side, it can be easily shown that . This completes the proof of the lemma. ∎
The following lemma is taken from Fan et al. (2018) with slight modification, which shows that the solution falls in a -cone.
Lemma 8** (-cone Property).**
For any such that , if , then
Now we are ready to prove the theorem.
Proof of Theorem 3.
It suffices to prove the statement for . We start by constructing an intermediate estimator such that for some to be specified. We take if , and choose so that otherwise. Lemma 8, also falls in a -cone:
[TABLE]
Under Condition 3, it follows from Lemma 1 that with probability at least ,
[TABLE]
as long as and . Applying Lemma 2 and following the same calculations as in Lemma B.7 of Fan et al. (2018), we obtain
[TABLE]
which, combined with , implies that
[TABLE]
Inequalities in (56) imply that . By the construction of , we conclude that , and thus the stated result holds. It remains to bound the probability that event occurs. Recall the gradient of evaluated at , i.e. . Following the same argument used in the proof of Theorem 1, we take for some and reach . This, together with (57), proves (9).
Finally, taking for some yields that with probability at least , . As implied by Condition 3 with , we have and thus (10) follows immediately. ∎
C.6 Proof of Theorem 4
The proof of this theorem follows the similar argument to that of Theorem 2. It suffices to prove the result for . Similar to the proof of Theorem 2, We start by defining a simple class of distributions for the response variable as , where
[TABLE]
Here, we suppress the dependence of and on for convenience. It follows that, for any , the -th absolute central moment of with law either or is
[TABLE]
We define the following -sparse sign-ball as
[TABLE]
By assumption, there exist and with such that . Take supported on and such that for a , and . Let be the distribution of and that of . Clearly, we have
[TABLE]
Let be the support of . Then, we have
[TABLE]
Let be any -sparse estimator. With the above setup, we have
[TABLE]
where we suppress the dependence of on for simplicity. For the last quantity in the displayed inequality above, taking with , we obtain and
[TABLE]
Using the fact that this further implies
[TABLE]
Now since , taking implies the result for the case where . When , the second moment exists. Thus using completes the proof. ∎
C.7 Proof of Theorem 5
The proof is almost identical to that of Theorem 3. We only need to derive a probability bound for the event under the assumed scaling and moment conditions, where .
Recall that with for and . Define , where . Moreover, write and . In this notation, we have . From the identity , we see that
[TABLE]
Then it holds
[TABLE]
For each fixed, note that
[TABLE]
Applying Bernstein’s inequality gives
[TABLE]
with probability at least . Taking the union bound over , we obtain that, with probability at least ,
[TABLE]
This, together with (60), implies that provided
[TABLE]
This is the stated result. ∎
C.8 Proof of Theorem 7
To begin with, define the parameter set for some to be specified, and let be the intermediate estimator introduced in the proof of Theorem 1.
Proof of (19). In view of (47) and (49), lying in the heart of the arguments is to derive deviation inequalities for under the moment condition that for some , and to establish the restricted strong convexity for the Huber loss , i.e. there exists some such that
[TABLE]
holds uniformly over in a neighborhood of .
First, from (41) in Lemma 6 we see that
[TABLE]
with probability at least . Next, since and according to Lemma 4, we take such that under the scaling (26),
[TABLE]
with probability at least . Together, the last two displays and (47) imply that with probability at least , provided , where is a constant depending only on . Following the same arguments as we used in the proof of Theorem 1, this proves (19).
Proof of (20). From the preceding proof, we see that
[TABLE]
as long as , where . Moreover, define random processes and
[TABLE]
To bound , the key is to bound the supremum of the empirical process . To that end, we deal with and separately, starting with the latter. By the mean value theorem,
[TABLE]
where is a convex combination of and . Therefore,
[TABLE]
For and , write such that . Let be the constant in Lemma 4 that scales as . It follows that
[TABLE]
which, further implies
[TABLE]
Next, we consider . With , define a new process , satisfying and . Note that, for every and ,
[TABLE]
Under Condition 5, there exist constants depending only on such that, for any ,
[TABLE]
With the above preparations and applying Theorem A.3 in Spokoiny (2013), we reach
[TABLE]
as long as . Together with (63), this yields
[TABLE]
with probability at least . Combine this bound with (61) to obtain the stated result (20). ∎
C.9 Proof of Proposition 2
Since , we have . Thus, for any , . In particular, taking to be 2 and proves the first conclusion. Next, note that . Letting , we deduce that
[TABLE]
By Markov’s inequality, . Putting the above calculations together proves the second inequality. ∎
C.10 Proof of Theorem 8
For simplicity, we write and assume without loss of generality that . As in the proof of Theorem 7, we construct an intermediate estimator satisfying for some to be specified. We take if ; otherwise if , there exists such that . Lemma 2 demonstrates that
[TABLE]
Next, let be the support of and define the -cone :
[TABLE]
We claim that
[TABLE]
from which it follows
[TABLE]
where . To prove (65), first, from the optimality of we see that
[TABLE]
By direct calculation, we have
[TABLE]
Under the scaling , it follows from the convexity of and Cauchy-Schwarz inequality that
[TABLE]
Together, (67) and (68) imply and thus .
By necessary conditions of extrema in the convex optimization problem (21),
[TABLE]
where satisfies . Under the scaling , it holds
[TABLE]
Together with (64), this implies
[TABLE]
Moreover, we introduce and note that . By (65), we also have under the assumed scaling.
Let be the event on which (39) holds. Then under the scaling (38) and it holds on that
[TABLE]
Substituting this lower bound into (69) yields
[TABLE]
Canceling on both sides delivers
[TABLE]
under the scaling and (38) .
It remains to calibrate the parameters and . First, applying Lemma 7 with , we see that
[TABLE]
with probability at least , where . We therefore choose for some constant , such that with probability at least . Next, according to (38), the restricted strong convexity (39) holds with . Putting the above calculations together, we conclude that
[TABLE]
with probability at least , assuming the scaling . By the construction of , with the same probability we must have and therefore . The stated result (23) then follows from (70). ∎
C.11 Proof of Corollary 1
Recall that are i.i.d. random vectors from a sub-Gaussian vector with . Let be an matrix whose rows are independent isotropic sub-Gaussian random vectors. Since , Definition 1 in Rudelson and Zhou (2013) holds with , , and . Taking in Theorem 16 of Rudelson and Zhou (2013) we obtain that, with probability at least ,
[TABLE]
for all as long as . This, together with (65) and (71), proves (24). ∎
References
- Belloni and Chernozhukov (2011)
Belloni, A. and Chernozhukov, V. (2011). -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39 82–130.
- Bickel, Ritov and Tsybakov (2009)
Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. The Annals of Statistics, 37 1705–1732.
- Boucheron, Lugosi and Massart (2013)
Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford.
- Bousquet (2003)
Bousquet, O. (2003). Concentration inequalities for sub-additive functions using the entropy method. In Stochastic Inequalities and Applications. Progress in Probability 56 213–247. Birkhäuser, Basel.
- Fan et al. (2018)
Fan, J., Liu, H., Sun, Q. and Zhang, T. (2018).
I-LAMM for sparse learning: Simultaneous control of algorithmic complexity and statistical error. The Annals of Statistics, 96 1348–1360.
- Ledoux and Talagrand (1991)
Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer-Verlag, Berlin.
- Lepski (1991)
Lepski, O. V. (1991). Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates. IEEE Transactions on Information Theory, 36 682–697.
- Loh and Wainwright (2015)
Loh, P.-L. and Wainwright, M. J. (2015). Regularized -estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16 559–616.
- Negahban et al. (2012)
Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. Statistical Science, 27 538–557.
- Rudelson and Zhou (2013)
Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements. IEEE Transactions on Information Theory, 59 3434–3447.
- Spokoiny (2013)
Spokoiny, V. (2013). Bernstein–von Mises theorem for growing parameter dimension. Preprint. Available at arXiv:1302.3430.
- van de Geer (2008)
van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36 614–645.
- Wainwright (2009)
Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55 2183–2202.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alquier, Cottett and Lecué (2017) Alquier, P. , Cottet, V. and Lecué, G. (2017). Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. Preprint. Available at ar Xiv:1702.01402 .
- 2Belloni and Chernozhukov (2011) Belloni, A. and Chernozhukov, V. (2011). ℓ 1 subscript ℓ 1 \ell_{1} -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics , 39 82–130.
- 3Bellec et al. (2018) Bellec, P. C. , Lecué, G. and Tsybakov, A. B. (2018). Slope meets Lasso: Improved oracle bounds and optimality. The Annals of Statistics , 46 3603–3642.
- 4Bickel, Ritov and Tsybakov (2009) Bickel, P. J. , Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics , 37 1705–1732.
- 5Bogdan et al. (2015) Bogdan, M. , van den Berg, E. , Sabatti, C. , Su, W. and Candès, E. J. (2015). SLOPE–Adaptive variable selection via convex optimization. The Annals of Applied Statistics , 9 1103–1140.
- 6Brownlees, Joly and Lugosi (2015) Brownlees, C. , Joly, E. and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics , 43 2507–2536.
- 7Bühlmann and van de Geer (2011) Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
- 8Catoni (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Annales de I’Institut Henri Poincaré - Probabilités et Statistiques , 48 1148–1185.
