Adaptive Huber Regression

Qiang Sun; Wenxin Zhou; and Jianqing Fan

arXiv:1706.06991·math.ST·October 11, 2018

Adaptive Huber Regression

Qiang Sun, Wenxin Zhou, and Jianqing Fan

PDF

Open Access 2 Repos

TL;DR

This paper introduces adaptive Huber regression, a robust method for handling heavy-tailed data and outliers in high-dimensional settings, with theoretical guarantees and practical applications.

Contribution

It develops an adaptive robust regression framework that adjusts to data moments, providing sharp phase transition results and extending to heavy-tailed predictors and noise.

Findings

01

Achieves sub-Gaussian deviation bounds without sub-Gaussian data assumptions for δ ≥ 1

02

Demonstrates a smooth, optimal phase transition in estimation accuracy based on data tail heaviness

03

Shows improved robustness and prediction in heavy-tailed genetic data applications

Abstract

Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded $(1 + δ)$ -th moment for any $δ > 0$ . We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when $δ \geq 1$ , the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime $0 < δ < 1$ . Furthermore, this transition is smooth and…

Tables2

Table 1. Table 1: Results for adaptive Huber regression (AHR) and ordinary least squares (OLS) when n = 100 𝑛 100 n=100 and d = 5 𝑑 5 d=5 . The mean and standard deviation (std) of ℓ 2 subscript ℓ 2 \ell_{2} -error based on 100 simulations are reported.

Noise	AHR		OLS
	mean	std	mean	std
Normal	0.566	0.189	0.567	0.191
Student’s $t$	0.806	0.651	1.355	2.306
Log-normal	3.917	3.740	8.529	13.679

Table 2. Table 2: We report the mean absolute error (MAE) for protein expressions based on the KRT19 antibody from the NCI-60 cancer cell lines, computed from leave-one-out cross-validation. We also report the model size and selected genes for each method.

Method	MAE	Size	Selected Genes
Lasso	7.64	42	FBLIM1, MT1E, EDN2, F3, FAM102B, S100A14, LAMB3, EPCAM, FN1, TM4SF1, UCHL1, NMU, ANXA3, PLAC8, SPP1, TGFBI, CD74, GPX3, EDN1, CPVL, NPTX2, TES, AKR1B10, CA2, TSPYL5, MAL2, GDA, BAMBI, CST6, ADAMTS15, DUSP6, BTG1, LGALS3, IFI27, MEIS2, TOX3, KRT23, BST2, SLPI, PLTP, XIST, NGFRAP1
AHuber	6.74	11	MT1E, ARHGAP29, CPCAM, VAMP8, MALL, ANXA3, MAL2, BAMBI, LGALS3, KRT19, TFF3
TAHuber	5.76	7	MT1E, ARHGAP29, MALL, ANXA3, MAL2, BAMBI, KRT19

Equations425

\displaystyle\mathbb{P}\big{[}|\widehat{\mu}_{{\rm C}}(t)-\mu|\leq t\sigma/n^{1/2}\big{]}\geq 1-2\exp(-ct^{2}),

\displaystyle\mathbb{P}\big{[}|\widehat{\mu}_{{\rm C}}(t)-\mu|\leq t\sigma/n^{1/2}\big{]}\geq 1-2\exp(-ct^{2}),

\displaystyle y_{i}=\langle\bm{x}_{i},\bm{\beta}^{*}\rangle+\varepsilon_{i},~{}\mbox{ with }~{}\mathbb{E}(\varepsilon_{i}|\bm{x}_{i})=0~{}\mbox{ and }~{}v_{i,\delta}=\mathbb{E}\big{(}|\varepsilon_{i}|^{1+\delta}\big{)}<\infty.

\displaystyle y_{i}=\langle\bm{x}_{i},\bm{\beta}^{*}\rangle+\varepsilon_{i},~{}\mbox{ with }~{}\mathbb{E}(\varepsilon_{i}|\bm{x}_{i})=0~{}\mbox{ and }~{}v_{i,\delta}=\mathbb{E}\big{(}|\varepsilon_{i}|^{1+\delta}\big{)}<\infty.

\ell_{\tau}(x)=\left\{\begin{array}[]{ll}x^{2}/2,&\mbox{if }|x|\leq\tau,\\ \tau|x|-\tau^{2}/2,&\mbox{if }|x|>\tau,\end{array}\right.

\ell_{\tau}(x)=\left\{\begin{array}[]{ll}x^{2}/2,&\mbox{if }|x|\leq\tau,\\ \tau|x|-\tau^{2}/2,&\mbox{if }|x|>\tau,\end{array}\right.

β_{τ} = ar g β \in R^{d} min L_{τ} (β) .

β_{τ} = ar g β \in R^{d} min L_{τ} (β) .

\displaystyle\widehat{}\bm{\beta}_{\tau,\lambda}\in\arg\min_{\bm{\beta}\in\mathbb{R}^{d}}\big{\{}\mathcal{L}_{\tau}(\bm{\beta})\!+\!\lambda\|\bm{\beta}\|_{1}\big{\}},

\displaystyle\widehat{}\bm{\beta}_{\tau,\lambda}\in\arg\min_{\bm{\beta}\in\mathbb{R}^{d}}\big{\{}\mathcal{L}_{\tau}(\bm{\beta})\!+\!\lambda\|\bm{\beta}\|_{1}\big{\}},

\displaystyle\big{\|}\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\big{\|}_{2}\lesssim d_{\textnormal{eff}}^{{1}/{2}}\,n_{\textnormal{eff}}^{-\min\{\delta/(1+\delta),1/2\}}~{}~{}\textnormal{with high probability. }

\displaystyle\big{\|}\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\big{\|}_{2}\lesssim d_{\textnormal{eff}}^{{1}/{2}}\,n_{\textnormal{eff}}^{-\min\{\delta/(1+\delta),1/2\}}~{}~{}\textnormal{with high probability. }

β_{τ}^{*} := ar g β \in R^{d} min E {L_{τ} (β)} = ar g β \in R^{d} min \frac{1}{n} i = 1 \sum n E {ℓ_{τ} (y_{i} - ⟨ x_{i}, β ⟩)},

β_{τ}^{*} := ar g β \in R^{d} min E {L_{τ} (β)} = ar g β \in R^{d} min \frac{1}{n} i = 1 \sum n E {ℓ_{τ} (y_{i} - ⟨ x_{i}, β ⟩)},

∥ β_{τ}^{*} - β^{*} ∥_{2} \leq 2 c_{l}^{- 1 / 2} v_{δ} τ^{- δ}

∥ β_{τ}^{*} - β^{*} ∥_{2} \leq 2 c_{l}^{- 1 / 2} v_{δ} τ^{- δ}

\displaystyle\underbrace{\big{\|}\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\big{\|}_{2}}_{\textnormal{total error}}\leq\underbrace{\big{\|}\widehat{}\bm{\beta}_{\tau}-\bm{\beta}_{\tau}^{*}\big{\|}_{2}}_{\textnormal{estimation error}}+\underbrace{\big{\|}\bm{\beta}_{\tau}^{*}-\bm{\beta}^{*}\big{\|}_{2}}_{\textnormal{approximation bias}},

\displaystyle\underbrace{\big{\|}\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\big{\|}_{2}}_{\textnormal{total error}}\leq\underbrace{\big{\|}\widehat{}\bm{\beta}_{\tau}-\bm{\beta}_{\tau}^{*}\big{\|}_{2}}_{\textnormal{estimation error}}+\underbrace{\big{\|}\bm{\beta}_{\tau}^{*}-\bm{\beta}^{*}\big{\|}_{2}}_{\textnormal{approximation bias}},

\displaystyle\big{\|}\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\big{\|}_{2}\leq 4c_{l}^{-1}L\tau_{0}\,d^{1/2}\bigg{(}\frac{t}{n}\bigg{)}^{\min\{\delta/(1+\delta),1/2\}}

\displaystyle\big{\|}\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\big{\|}_{2}\leq 4c_{l}^{-1}L\tau_{0}\,d^{1/2}\bigg{(}\frac{t}{n}\bigg{)}^{\min\{\delta/(1+\delta),1/2\}}

\displaystyle\sup_{\mathbb{P}\in\mathcal{P}_{\delta}^{v_{\delta}}}\mathbb{P}\Bigg{[}\big{\|}\widehat{}\bm{\beta}-\bm{\beta}^{*}\big{\|}_{2}\geq\alpha c_{u}^{-1}\nu_{\delta}\,d^{1/2}\bigg{(}\frac{t}{n}\bigg{)}^{\min\{\delta/(1+\delta),1/2\}}\Bigg{]}\geq\frac{e^{-2t}}{2},

\displaystyle\sup_{\mathbb{P}\in\mathcal{P}_{\delta}^{v_{\delta}}}\mathbb{P}\Bigg{[}\big{\|}\widehat{}\bm{\beta}-\bm{\beta}^{*}\big{\|}_{2}\geq\alpha c_{u}^{-1}\nu_{\delta}\,d^{1/2}\bigg{(}\frac{t}{n}\bigg{)}^{\min\{\delta/(1+\delta),1/2\}}\Bigg{]}\geq\frac{e^{-2t}}{2},

\displaystyle\kappa_{+}(m,\gamma,r)=\sup\Big{\{}{\langle\bm{u},\mathbf{H}_{\tau}(\bm{\beta})\bm{u}\rangle}:(\bm{u},\bm{\beta})\in\mathcal{C}(m,\gamma,r)\Big{\}},

\displaystyle\kappa_{+}(m,\gamma,r)=\sup\Big{\{}{\langle\bm{u},\mathbf{H}_{\tau}(\bm{\beta})\bm{u}\rangle}:(\bm{u},\bm{\beta})\in\mathcal{C}(m,\gamma,r)\Big{\}},

\displaystyle\kappa_{-}(m,\gamma,r)=\inf\Big{\{}{\langle\bm{u},\mathbf{H}_{\tau}(\bm{\beta})\bm{u}\rangle}:(\bm{u},\bm{\beta})\in\mathcal{C}(m,\gamma,r)\Big{\}},

ρ_{+} (m, γ)

ρ_{+} (m, γ)

ρ_{-} (m, γ)

\displaystyle\big{\|}\widehat{}\bm{\beta}_{\tau,\lambda}-\bm{\beta}^{*}\big{\|}_{2}\leq 3\kappa_{l}^{-1}s^{1/2}\lambda,

\displaystyle\big{\|}\widehat{}\bm{\beta}_{\tau,\lambda}-\bm{\beta}^{*}\big{\|}_{2}\leq 3\kappa_{l}^{-1}s^{1/2}\lambda,

\displaystyle\big{\|}\widehat{}\bm{\beta}_{\tau,\lambda}-\bm{\beta}^{*}\big{\|}_{2}\lesssim\kappa_{l}^{-1}L\tau_{0}\,s^{1/2}\bigg{\{}\frac{(1+c)\log d}{n}\bigg{\}}^{\min\{\delta/(1+\delta),1/2\}}

\displaystyle\big{\|}\widehat{}\bm{\beta}_{\tau,\lambda}-\bm{\beta}^{*}\big{\|}_{2}\lesssim\kappa_{l}^{-1}L\tau_{0}\,s^{1/2}\bigg{\{}\frac{(1+c)\log d}{n}\bigg{\}}^{\min\{\delta/(1+\delta),1/2\}}

\displaystyle\sup_{\mathbb{P}\in\mathcal{P}_{\delta}^{v_{\delta}}}\mathbb{P}\Bigg{[}\big{\|}\widehat{}\bm{\beta}-\bm{\beta}^{*}\big{\|}_{2}\geq\nu_{\delta}\frac{\alpha s^{1/2}}{\kappa_{u}}\bigg{(}\frac{A\log d}{2n}\bigg{)}^{\min\{{\delta}/({1+\delta}),1/2\}}\Bigg{]}\geq 2^{-1}d^{-A},

\displaystyle\sup_{\mathbb{P}\in\mathcal{P}_{\delta}^{v_{\delta}}}\mathbb{P}\Bigg{[}\big{\|}\widehat{}\bm{\beta}-\bm{\beta}^{*}\big{\|}_{2}\geq\nu_{\delta}\frac{\alpha s^{1/2}}{\kappa_{u}}\bigg{(}\frac{A\log d}{2n}\bigg{)}^{\min\{{\delta}/({1+\delta}),1/2\}}\Bigg{]}\geq 2^{-1}d^{-A},

\displaystyle\widehat{\bm{\beta}}_{\tau,\varpi,\lambda}\in\arg\min_{\bm{\beta}\in\mathbb{R}^{d}}\big{\{}\mathcal{L}^{\varpi}_{\tau}(\bm{\beta})+\lambda\|\bm{\beta}\|_{1}\big{\}},

\displaystyle\widehat{\bm{\beta}}_{\tau,\varpi,\lambda}\in\arg\min_{\bm{\beta}\in\mathbb{R}^{d}}\big{\{}\mathcal{L}^{\varpi}_{\tau}(\bm{\beta})+\lambda\|\bm{\beta}\|_{1}\big{\}},

\displaystyle\big{\|}\widehat{\bm{\beta}}_{\tau,\varpi,\lambda}-\bm{\beta}^{*}\big{\|}_{2}\leq 3\kappa_{l}^{-1}s^{1/2}\lambda.

\displaystyle\big{\|}\widehat{\bm{\beta}}_{\tau,\varpi,\lambda}-\bm{\beta}^{*}\big{\|}_{2}\leq 3\kappa_{l}^{-1}s^{1/2}\lambda.

λ

λ

\displaystyle\quad+2\big{(}2\sigma^{2}M_{2}+2M_{4}\|\bm{\beta}^{*}\|_{2}^{2}\,s\big{)}^{1/2}\sqrt{\frac{t}{n}}+\varpi\tau\frac{t}{n},

\tau\asymp s^{1/2}\bigg{(}\frac{n}{\log d}\bigg{)}^{1/4},\ \ \varpi\asymp\bigg{(}\frac{n}{\log d}\bigg{)}^{1/4}~{}\mbox{ and }~{}\lambda\asymp\sqrt{\frac{s\log d}{n}}.

\tau\asymp s^{1/2}\bigg{(}\frac{n}{\log d}\bigg{)}^{1/4},\ \ \varpi\asymp\bigg{(}\frac{n}{\log d}\bigg{)}^{1/4}~{}\mbox{ and }~{}\lambda\asymp\sqrt{\frac{s\log d}{n}}.

g (β ∣ β^{(k)}) \geq f (β) \mbox an d g (β^{(k)} ∣ β^{(k)}) = f (β^{(k)}) .

g (β ∣ β^{(k)}) \geq f (β) \mbox an d g (β^{(k)} ∣ β^{(k)}) = f (β^{(k)}) .

f (β^{(k + 1)}) \leq \mbox maj or . g (β^{(k + 1)} ∣ β^{(k)}) \leq \mbox min . g (β^{(k)} ∣ β^{(k)}) = \mbox ini t . f (β^{(k)}) .

f (β^{(k + 1)}) \leq \mbox maj or . g (β^{(k + 1)} ∣ β^{(k)}) \leq \mbox min . g (β^{(k)} ∣ β^{(k)}) = \mbox ini t . f (β^{(k)}) .

g_{k}(\bm{\beta}|\bm{\beta}^{(k)})=\mathcal{L}_{\tau}(\bm{\beta}^{(k)})+\big{\langle}\nabla\mathcal{L}_{\tau}(\bm{\beta}^{(k)}),\,\bm{\beta}-\bm{\beta}^{(k)}\big{\rangle}+\frac{\phi_{k}}{2}\big{\|}\bm{\beta}-\bm{\beta}^{(k)}\big{\|}_{2}^{2},

g_{k}(\bm{\beta}|\bm{\beta}^{(k)})=\mathcal{L}_{\tau}(\bm{\beta}^{(k)})+\big{\langle}\nabla\mathcal{L}_{\tau}(\bm{\beta}^{(k)}),\,\bm{\beta}-\bm{\beta}^{(k)}\big{\rangle}+\frac{\phi_{k}}{2}\big{\|}\bm{\beta}-\bm{\beta}^{(k)}\big{\|}_{2}^{2},

\displaystyle\min_{\bm{\beta}\in\mathbb{R}^{d}}\biggl{\{}\big{\langle}\nabla\mathcal{L}_{\tau}(\bm{\beta}^{(k)}),\bm{\beta}-\bm{\beta}^{(k)}\big{\rangle}+\frac{\phi_{k}}{2}\big{\|}\bm{\beta}-\bm{\beta}^{(k)}\big{\|}_{2}^{2}+\lambda\big{\|}\bm{\beta}\big{\|}_{1}\biggr{\}}.

\displaystyle\min_{\bm{\beta}\in\mathbb{R}^{d}}\biggl{\{}\big{\langle}\nabla\mathcal{L}_{\tau}(\bm{\beta}^{(k)}),\bm{\beta}-\bm{\beta}^{(k)}\big{\rangle}+\frac{\phi_{k}}{2}\big{\|}\bm{\beta}-\bm{\beta}^{(k)}\big{\|}_{2}^{2}+\lambda\big{\|}\bm{\beta}\big{\|}_{1}\biggr{\}}.

\bm{\beta}^{(k+1)}=T_{\lambda,\phi_{k}}(\bm{\beta}^{(k)})=S\Big{(}\bm{\beta}^{(k)}-{\phi_{k}^{-1}}{\nabla\mathcal{L}_{\tau}(\bm{\beta}^{(k)})},{\phi_{k}^{-1}}\lambda\Big{)},

\bm{\beta}^{(k+1)}=T_{\lambda,\phi_{k}}(\bm{\beta}^{(k)})=S\Big{(}\bm{\beta}^{(k)}-{\phi_{k}^{-1}}{\nabla\mathcal{L}_{\tau}(\bm{\beta}^{(k)})},{\phi_{k}^{-1}}\lambda\Big{)},

τ = c_{τ} \times σ (\frac{n _{eff}}{t})^{1/2} and λ = c_{λ} \times σ (\frac{n _{eff}}{t})^{1/2},

τ = c_{τ} \times σ (\frac{n _{eff}}{t})^{1/2} and λ = c_{λ} \times σ (\frac{n _{eff}}{t})^{1/2},

y_{i} = ⟨ x_{i}, β^{*} ⟩ + ε_{i}, i = 1, \dots, n,

y_{i} = ⟨ x_{i}, β^{*} ⟩ + ε_{i}, i = 1, \dots, n,

-\log\big{(}\|\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\|_{2}\big{)}\asymp\frac{\delta}{1+\delta}\log(n)-\frac{1}{1+\delta}\log(v_{\delta}),\ \ 0<\delta\leq 1,

-\log\big{(}\|\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\|_{2}\big{)}\asymp\frac{\delta}{1+\delta}\log(n)-\frac{1}{1+\delta}\log(v_{\delta}),\ \ 0<\delta\leq 1,

-\log\big{(}\|\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\|_{2}\big{)}\asymp\frac{\delta}{1+\delta}\log\Big{(}\frac{n}{\log d}\Big{)}-\frac{1}{1+\delta}\log(v_{\delta}),\ \ 0<\delta\leq 1,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Bayesian Methods and Mixture Models · Advanced Statistical Methods and Models

Full text

Adaptive Huber Regression††thanks: Qiang Sun is Assistant Professor, Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada (E-mail: [email protected]).

Wen-Xin Zhou is Assistant Professor, Department of Mathematics, University of California, San Diego, La Jolla, CA 92093 (E-mail: [email protected]). Jianqing Fan is Honorary Professor, School of Data Science, Fudan University, Shanghai, China and Frederick L. Moore ’18 Professor of Finance, Department of Operations Research and Financial Engineering, Princeton University, NJ 08544 (E-mail: [email protected]).

Qiang Sun, Wen-Xin Zhou, and Jianqing Fan

Abstract

Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded $(1+\delta)$ -th moment for any $\delta>0$ . We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when $\delta\geq 1$ , the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime $0<\delta<1$ . Furthermore, this transition is smooth and optimal. We extend the methodology to allow both heavy-tailed predictors and observation noise. Simulation studies lend further support to the theory. In a genetic study of cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown to be more robust and predictive.

Keywords: Adaptive Huber regression, bias and robustness tradeoff, finite-sample inference, heavy-tailed data, nonasymptotic optimality, phase transition.

1 Introduction

Modern data acquisitions have facilitated the collection of massive and high dimensional data with complex structures. Along with holding great promises for discovering subtle population patterns that are less achievable with small-scale data, big data have introduced a series of new challenges to data analysis both computationally and statistically (Loh and Wainwright, 2015; Fan et al., 2018). During the last two decades, extensive progress has been made towards extracting useful information from massive data with high dimensional features and sub-Gaussian tails111A random variable $Z$ is said to have sub-Gaussian tails if there exists constants $c_{1}$ and $c_{2}$ such that $\mathbb{P}(|Z|>t)\leq c_{1}\exp(-c_{2}t^{2})$ for any $t\geq 0$ . (Tibshirani, 1996; Fan and Li, 2001; Efron et al., 2004; Bickel, Ritov and Tsybakov, 2009). We refer to the monographs, Bühlmann and van de Geer (2011) and Hastie, Tibshirani and Wainwright (2015), for a systematic coverage of contemporary statistical methods for high dimensional data.

The sub-Gaussian tails requirement, albeit being convenient for theoretical analysis, is not realistic in many practical applications since modern data are often collected with low quality. For example, a recent study on functional magnetic resonance imaging (fMRI) (Eklund, Nichols and Knutsson, 2016) shows that the principal cause of invalid fMRI inferences is that the data do not follow the assumed Gaussian shape, which speaks to the need of validating the statistical methods being used in the field of neuroimaging. In a microarray data example considered in Wang, Peng and Li (2015), it is observed that some gene expression levels have heavy tails as their kurtosises are much larger than 3, despite of the normalization methods used. In finance, the power-law nature of the distribution of returns has been validated as a stylized fact (Cont, 2001). Fan et al. (2016) argued that heavy-tailed distribution is a stylized feature for high dimensional data and proposed a shrinkage principle to attenuate the influence of outliers. Standard statistical procedures that are based on the method of least squares often behave poorly in the presence of heavy-tailed data222We say a random variable $X$ has heavy tails if $\mathbb{P}(|X|>t)$ decays to zero polynomially in $1/t$ as $t\to\infty$ . (Catoni, 2012). It is therefore of ever-increasing interest to develop new statistical methods that are robust against heavy-tailed errors and other potential forms of contamination.

In this paper, we first revisit the robust regression that was initiated by Peter Huber in his seminal work Huber (1973). Asymptotic properties of the Huber estimator have been well studied in the literature. We refer to Huber (1973), Yohai and Maronna (1979), Portnoy (1985), Mammen (1989) and He and Shao (1996, 2000) for an unavoidably incomplete overview. However, in all of the aforementioned papers, the robustification parameter is suggested to be set as fixed according to the 95% asymptotic efficiency rule. Thus, this procedure can not estimate the model-generating parameters consistently when the sample distribution is asymmetric.

From a nonasymptotic perspective (rather than an asymptotic efficiency rule), we propose to use the Huber regression with an adaptive robustification parameter, which is referred to as the adaptive Huber regression, for robust estimation and inference. Our adaptive procedure achieves the nonasymptotic robustness in the sense that the resulting estimator admits exponential-type concentration bounds when only low-order moments exist. Moreover, the resulting estimator is also an asymptotically unbiased estimate for the parameters of interest. In particular, we do not impose symmetry and homoscedasticity conditions on error distributions, so that our problem is intrinsically different from median/quantile regression models, which are also of independent interest and serve as important robust techniques (Koenker, 2005).

We made several major contributions towards robust modeling in this paper. First and foremost, we establish nonasymptotic deviation bounds for adaptive Huber regression when the error variables have only finite $(1+\delta)$ -th moments. By providing a matching lower bound, we observe a sharp phase transition phenomenon, which is in line with that discovered by Devroye et al. (2016) for univariate mean estimation. Second, a similar phase transition for regularized adaptive Huber regression is established in high dimensions. By defining the effective dimension and effective sample size, we present nonasymptotic results under the two different regimes in a unified form. Last, by exploiting the localized analysis developed in Fan et al. (2018), we remove the artificial bounded parameter constraint imposed in previous works; see Loh and Wainwright (2015) and Fan, Li and Wang (2017). In the supplementary material, we present a nonasymptotic Bahadur representation for the adaptive Huber estimator when $\delta\geq 1$ , which provides a theoretical foundation for robust finite-sample inference.

The rest of the paper proceeds as follows. The rest of this section is devoted to related literature. In Section 2, we revisit the Huber loss and robustification parameter, followed by the proposal of adaptive Huber regression in both low and high dimensions. We sharply characterize the nonasymptotic performance of the proposed estimators in Section 3. We describe the algorithm and implementation in Section 5. Section 6 is devoted to simulation studies and a real data application. In Section 4, we extend the methodology to allow possibly heavy-tailed covariates/predictors. All the proofs are collected in the supplemental material.

1.1 Related Literature

The terminology “robustness” used in this paper describes how stable the method performs with respect to the tail-behavior of the data, which can be either sub-Gaussian/sub-exponential or Pareto-like (Delaigle, Hall and Jin, 2011; Catoni, 2012; Devroye et al., 2016). This is different from the conventional perspective of robust statistics under Huber’s $\epsilon$ -contamination model (Huber, 1964), for which a number of depth-based procedures have been developed since the groundbreaking work of John Tukey (Tukey, 1975). Significant contributions have also been made in Liu (1990), Liu, Parelius, and Singh (1999), Zuo and Serfling (2000), Mizera (2002) and Mizera and Müller (2004). We refer to Chen, Gao and Ren (2018) for the most recent result and a literature review concerning this problem.

Our main focus is on the conditional mean regression in the presence of heavy-tailed and asymmetric errors, which automatically distinguishes our method from quantile-based robust regressions (Koenker, 2005; Belloni and Chernozhukov, 2011; Wang, 2013; Fan, Fan and Barut, 2014; Zheng, Peng and He, 2015). In general, quantile regression is biased towards estimating the mean regression coefficient unless the error distributions are symmetric around zero. Another recent work that is related to ours is Alquier, Cottett and Lecué (2017). They studied a general class of regularized empirical risk minimization procedures with a particular focus on Lipschitz losses, which includes the quantile, hinge and logistic losses. Different from all these work, our goal is to estimate the mean regression coefficients robustly. The robustness is witnessed by a nonasymptotic analysis: the proposed estimators achieve sub-Gaussian deviation bounds when the regression errors have only finite second moments. Asymptotically, our proposed estimators are fully efficient: they achieve the same efficiency as the ordinary least squares estimators.

An important step towards estimation under heavy-tailedness has been made by Catoni (2012), whose focus is on estimating a univariate mean. Let $X$ be a real-valued random variable with mean $\mu=\mathbb{E}(X)$ and variance $\sigma^{2}=\textnormal{var}(X)>0$ , and assume that $X_{1},\ldots,X_{n}$ are independent and identically distributed (i.i.d.) from $X$ . For any prespecified exception probability $t>0$ , Catoni constructs a robust mean estimator $\widehat{\mu}_{{\rm C}}(t)$ that deviates from the true mean $\mu$ logarithmically in $1/t$ , that is,

[TABLE]

while the empirical mean deviates from the true mean only polynomially in $1/t^{2}$ , namely subGaussian tails versus Cauchy tail in terms of $t$ . Further, Devroye et al. (2016) developed adaptive sub-Gaussian estimators that are independent of the prespecified exception probability. Beyond mean estimation, Brownlees, Joly and Lugosi (2015) extended Catoni’s idea to study empirical risk minimization problems when the losses are unbounded. Generalizations of the univariate results to those for matrices, such as the covariance matrices, can be found in Catoni (2016), Minsker (2018), Giulini (2017) and Fan, Li and Wang (2017). Fan, Li and Wang (2017) modified Huber’s procedure (Huber, 1973) to obtain a robust estimator, which is concentrated around the true mean with exponentially high probability in the sense of (1), and also proposed a robust procedure for sparse linear regression with asymmetric and heavy-tailed errors.

Notation: We fix some notations that will be used throughout this paper. For any vector $\bm{u}=(u_{1},\ldots,u_{d})^{\mathrm{\scriptstyle T}}\in\mathbb{R}^{d}$ and $q\geq 1$ , $\|\bm{u}\|_{q}=(\sum_{j=1}^{d}|u_{j}|^{q})^{1/q}$ is the $\ell_{q}$ norm. For any vectors $\bm{u},\bm{v}\in\mathbb{R}^{d}$ , we write $\langle\bm{u},\bm{v}\rangle=\bm{u}^{\mathrm{\scriptstyle T}}\bm{v}$ . Moreover, we let $\|\bm{u}\|_{0}=\sum_{j=1}^{d}1(u_{j}\!\neq\!0)$ denote the number of nonzero entries of $\bm{u}$ , and set $\|\bm{u}\|_{\infty}=\max_{1\leq j\leq d}|u_{j}|$ . For two sequences of real numbers $\{a_{n}\}_{n\geq 1}$ and $\{b_{n}\}_{n\geq 1}$ , $a_{n}\lesssim b_{n}$ denotes $a_{n}\leq Cb_{n}$ for some constant $C>0$ independent of $n$ , $a_{n}\gtrsim b_{n}$ if $b_{n}\lesssim a_{n}$ , and $a_{n}\asymp b_{n}$ if $a_{n}\lesssim b_{n}$ and $b_{n}\lesssim a_{n}$ . For two scalars, we use $a\wedge b=\min\{a,b\}$ to denote the minimum of $a$ and $b$ . If $\mathbf{A}$ is an $m\times n$ matrix, we use $\|\mathbf{A}\|$ to denote its spectral norm, defined by $\|\mathbf{A}\|=\max_{\bm{u}\in\mathbb{S}^{n-1}}\|\mathbf{A}\bm{u}\|_{2}$ , where $\mathbb{S}^{n-1}=\{\bm{u}\in\mathbb{R}^{n}:\|\bm{u}\|_{2}=1\}$ is the unit sphere in $\mathbb{R}^{n}$ . For an $n\times n$ matrix $\mathbf{A}$ , we use $\lambda_{\max}(\mathbf{A})$ and $\lambda_{\min}(\mathbf{A})$ to denote the maximum and minimum eigenvalues of $\mathbf{A}$ , respectively. For two $n\times n$ matrices $\mathbf{A}$ and $\mathbf{B}$ , we write $\mathbf{A}\preceq\mathbf{B}$ if $\mathbf{B}-\mathbf{A}$ is positive semi-definite. For a function $f:\mathbb{R}^{d}\to\mathbb{R}$ , we use $\nabla f\in\mathbb{R}^{d}$ to denote its gradient vector as long as it exists.

2 Methodology

We consider i.i.d. observations $(y_{1},\bm{x}_{1}),\ldots,(y_{n},\bm{x}_{n})$ that are generated from the following heteroscedastic regression model

[TABLE]

Assuming that the second moments are bounded ( $\delta=1$ ), the standard ordinary least squares (OLS) estimator, denoted by $\widehat{}\bm{\beta}^{\textnormal{ols}}$ , admits a suboptimal polynomial-type deviation bound, and thus does not concentrate around $\bm{\beta}^{*}$ tightly enough for large-scale simultaneous estimation and inference. The key observation that underpins this suboptimality of the OLS estimator is the sensitivity of quadratic loss to outliers (Huber, 1973; Catoni, 2012), while the Huber regression with a fixed tuning constant may lead to nonnegligible estimation bias. To overcome this drawback, we propose to employ the Huber loss with an adaptive robustification parameter to achieve robustness and (asymptotic) unbiasedness simultaneously. We begin with the definitions of the Huber loss and the corresponding robustification parameter.

Definition 1 (Huber Loss and Robustification Parameter).

The Huber loss $\ell_{\tau}(\cdot)$ (Huber, 1964) is defined as

[TABLE]

where $\tau>0$ is referred to as the robustification parameter that balances bias and robustness (Fan, Li and Wang, 2017).

The loss function $\ell_{\tau}(x)$ is quadratic for small values of $x$ , and becomes linear when $x$ exceeds $\tau$ in magnitude. The parameter $\tau$ therefore controls the blending of quadratic and $\ell_{1}$ losses, which can be regarded as two extremes of the Huber loss with $\tau=\infty$ and $\tau\rightarrow 0$ , respectively. Comparing with the least squares, outliers are down weighted in the Huber loss. We will use the name, adaptive Huber loss, to emphasize the fact that the parameter $\tau$ should adapt to the sample size, dimension and moments for a better tradeoff between bias and robustness. This distinguishes our framework from the classical setting. As $\tau\to\infty$ is needed to reduce the bias when the error distribution is asymmetric, this loss is also called the RA-quadratic (robust approximation to quadratic) loss in Fan, Li and Wang (2017).

Define the empirical loss function $\mathcal{L}_{\tau}(\bm{\beta})=n^{-1}\sum_{i=1}^{n}\ell_{\tau}(y_{i}-\langle\bm{x}_{i},\bm{\beta}\rangle)$ for $\bm{\beta}\in\mathbb{R}^{d}$ . The Huber estimator is defined through the following convex optimization problem:

[TABLE]

In low dimensions, under the condition that $v_{\delta}=n^{-1}\sum_{i=1}^{n}\mathbb{E}(|\varepsilon_{i}|^{1+\delta})<\infty$ for some $\delta>0$ , we will prove that $\widehat{\bm{\beta}}_{\tau}$ with $\tau\asymp\min\{v_{\delta}^{1/(1+\delta)},v_{1}^{1/2}\}\,n^{\max\{1/(1+\delta),1/2\}}$ (the first factor is kept in order to show its explicit dependence on the moment) achieves the tight upper bound $d^{1/2}\tau^{-(\delta\wedge 1)}\asymp d^{1/2}n^{-\min\{\delta/(1+\delta),1/2\}}$ . The phase transition at $\delta=1$ can be easily observed (see Figure 1). When higher moments exist ( $\delta\geq 1$ ), robustification leads to a sub-Gaussian-type deviation inequality in the sense of (1).

In the high dimensional regime, we consider the following regularized adaptive Huber regression with a different choice of the robustification parameter:

[TABLE]

where $\tau\asymp\nu_{\delta}\{n/(\log d)\}^{\max\{1/(1+\delta),1/2\}}$ and $\lambda\asymp\nu_{\delta}\{(\log d)/n\}^{\min\{\delta/(1+\delta),1/2\}}$ with $\nu_{\delta}=\min\{v_{\delta}^{1/(1+\delta)},v_{1}^{1/2}\}.$ Let $s$ be the size of the true support ${\mathcal{S}}=\mathrm{supp}(\bm{\beta}^{*})$ . We will show that the regularized Huber estimator achieves an upper bound that is of the order $s^{1/2}\{(\log d)/{n}\}^{\min\{\delta/(1+\delta),1/2\}}$ for estimating $\bm{\beta}^{*}$ in $\ell_{2}$ -error with high probability.

To unify the nonasymptotic upper bounds in the two different regimes, we define the effective dimension, $d_{\textnormal{eff}}$ , to be $d$ in low dimensions and $s$ in high dimensions. In other words, $d_{\textnormal{eff}}$ denotes the number of nonzero parameters of the problem. The effective sample size, $n_{\textnormal{eff}}$ , is defined as $n_{\textnormal{eff}}=n$ and $n_{\textnormal{eff}}=n/\log d$ in low and high dimensions, respectively. We will establish a phase transition: when $\delta\geq 1$ , the proposed estimator enjoys a sub-Gaussian concentration, while it only achieves a slower concentration when $0<\delta<1$ . Specifically, we show that, for any $\delta\in(0,\infty)$ , the proposed estimators with $\tau\asymp\min\{v_{\delta}^{1/(1+\delta)},v_{1}^{1/2}\}\,n_{\textnormal{eff}}^{\max\{1/(1+\delta),1/2\}}$ achieve the following tight upper bound, up to logarithmic factors:

[TABLE]

This finding is summarized in Figure 1.

3 Nonasymptotic Theory

3.1 Adaptive Huber Regression with Increasing Dimensions

We begin with the adaptive Huber regression in the low dimensional regime. First, we provide an upper bound for the estimation bias of Huber regression. We then establish the phase transition by establishing matching upper and lower bounds on the $\ell_{2}$ -error. The analysis is carried out under both fixed and random designs. The results under random designs are provided in the supplementary material. We start with the following regularity condition.

Condition 1.

The empirical Gram matrix $\mathbf{S}_{n}:=n^{-1}\sum_{i=1}^{n}\bm{x}_{i}\bm{x}_{i}^{\mathrm{\scriptstyle T}}$ is nonsingular. Moreover, there exist constants $c_{l}$ and $c_{u}$ such that $c_{l}\leq\lambda_{\min}(\mathbf{S}_{n})\leq\lambda_{\max}(\mathbf{S}_{n})\leq c_{u}$ .

For any $\tau>0$ , $\widehat{\bm{\beta}}_{\tau}$ given in (3) is natural $M$ -estimator of

[TABLE]

where the expectation is taken over the regression errors. We call $\bm{\beta}^{*}_{\tau}$ the Huber regression coefficient, which is possibly different from the vector of true parameters $\bm{\beta}^{*}$ . The estimation bias, measured by $\|\bm{\beta}^{*}_{\tau}-\bm{\beta}^{*}\|_{2}$ , is a direct consequence of robustification and asymmetric error distributions. Heuristically, choosing a sufficiently large $\tau$ reduces bias at the cost of losing robustness (the extreme case of $\tau=\infty$ corresponds to the least squares estimator). Our first result shows how the magnitude of $\tau$ affects the bias $\|\bm{\beta}^{*}_{\tau}-\bm{\beta}^{*}\|_{2}$ . Recall that $v_{\delta}=n^{-1}\sum_{i=1}^{n}v_{i,\delta}$ with $v_{i,\delta}=\mathbb{E}(|\varepsilon_{i}|^{1+\delta})$ .

Proposition 1.

Assume Condition 1 holds and that $v_{\delta}$ is finite for some $\delta>0$ . Then, the vector $\bm{\beta}^{*}_{\tau}$ of Huber regression coefficients satisfies

[TABLE]

provided $\tau\geq(4v_{\delta}\widetilde{M}^{2})^{1/(1+\delta)}$ for $0<\delta<1$ or $\tau\geq(2v_{1})^{1/2}\widetilde{M}$ for $\delta\geq 1$ , where $\widetilde{M}=\max_{1\leq i\leq n}\|\mathbf{S}_{n}^{-1/2}\bm{x}_{i}\|_{2}.$

The total estimation error $\|\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\|_{2}$ can therefore be decomposed into two parts

[TABLE]

where the approximation bias is of order $\tau^{-\delta}$ . A large $\tau$ reduces the bias but compromises the degree of robustness. Thus an optimal estimator is the one with $\tau$ diverging at a certain rate to achieve the optimal tradeoff between estimation error and approximation bias. Our next result presents nonasymptotic upper bounds on the $\ell_{2}$ -error with an exponential-type exception probability, when $\tau$ is properly tuned. Recall that $\nu_{\delta}=\min\{v_{\delta}^{1/(1+\delta)},v_{1}^{1/2}\}$ for any $\delta>0$ .

Theorem 1 (Upper Bound).

Assume Condition 1 holds and $v_{\delta}<\infty$ for some $\delta>0$ . Let $L=\max_{1\leq i\leq n}\|\bm{x}_{i}\|_{\infty}$ and assume $n\geq C(L,c_{l})d^{2}t$ for some $C(L,c_{l})>0$ depending only on $L$ and $c_{l}$ . Then, for any $t>0$ and $\tau_{0}\geq\nu_{\delta}$ , the estimator $\widehat{\bm{\beta}}_{\tau}$ with $\tau=\tau_{0}(n/t)^{\max\{1/(1+\delta),1/2\}}$ satisfies the bound

[TABLE]

with probability at least $1-(2d+1)e^{-t}$ .

Remark 1.

It is worth mentioning that the proposed robust estimator depends on the unknown parameter $v_{\delta}^{1/(1+\delta)}$ . Adaptation to the unknown moment is indeed another important problem. In Section 6, we suggest a simple cross-validation scheme for choosing $\tau$ with desirable numerical performance. A general adaptive construction of $\tau$ can be obtained via Lepski’s method (Lepski, 1991), which is more challenging due to unspecified constants. In the supplementary material, we discuss a variant of Lepski’s method and establish its theoretical guarantee.

Remark 2.

We do not assume $\mathbb{E}(|\varepsilon_{i}|^{1+\delta}|\bm{x}_{i})$ to be a constant, and hence the proposed method accommodates heteroscedastic regression models. For example, $\varepsilon_{i}$ can take the form of $\sigma(\bm{x}_{i})v_{i}$ , where $\sigma:\mathbb{R}^{d}\to(0,\infty)$ is a positive function, and $v_{i}$ are random variables satisfying $\mathbb{E}(v_{i})=0$ and $\mathbb{E}(|v_{i}|^{1+\delta})<\infty$ .

Remark 3.

We need the scaling condition to go roughly as $n\gtrsim d^{2}t$ under fixed designs. With random designs, we show that the scaling condition can be relaxed to $n\gtrsim d+t$ . Details are given in the supplementary material.

Theorem 1 indicates that, with only bounded $(1+\delta)$ -th moment, the adaptive Huber estimator achieves the upper bound $d^{1/2}n^{-\min\{\delta/(1+\delta),1/2\}}$ , up to a logarithmic factor, by setting $t=\log(nd)$ . A natural question is whether the upper bound in (8) is optimal. To address this, we provide a matching lower bound up to a logarithmic factor. Let $\mathcal{P}_{\delta}^{v_{\delta}}$ be the class of all distributions on $\mathbb{R}$ whose $(1+\delta)$ -th absolute central moment equals $v_{\delta}.$ Let $\mathbf{X}=(\bm{x}_{1},\ldots,\bm{x}_{n})^{\mathrm{\scriptstyle T}}=(\bm{x}^{1},\ldots,\bm{x}^{d})\in\mathbb{R}^{n\times d}$ be the design matrix and $\mathcal{U}_{n}=\{\bm{u}:\bm{u}\in\{-1,1\}^{n}\}.$

Theorem 2 (Lower Bound).

Assume that the regression errors $\varepsilon_{i}$ are i.i.d. from a distribution in $\mathcal{P}_{\delta}^{v_{\delta}}$ with $\delta>0$ . Suppose there exists a $\bm{u}\in\mathcal{U}_{n}$ such that $\|n^{-1}\mathbf{X}^{\mathrm{\scriptstyle T}}\bm{u}\|_{\min}\geq\alpha$ for some $\alpha>0$ . Then, for any $t\in[0,n/2]$ and any estimator $\widehat{}\bm{\beta}=\widehat{\bm{\beta}}(y_{1},\ldots,y_{n},t)$ possibly depending on $t$ , we have

[TABLE]

where $c_{u}\geq\lambda_{\max}(\mathbf{S}_{n})$ .

Theorem 2 reveals that root- $n$ consistency with exponential concentration is impossible when $\delta\in(0,1)$ . It widens the phenomenon observed in Theorem 3.1 in Devroye et al. (2016) for estimating a mean. In addition to the eigenvalue assumption, we need to assume that there exists a $\bm{u}\in\mathcal{U}_{n}\subseteq\mathbb{R}^{n}$ such that the minimum angle between $n^{-1}\bm{u}$ and $\bm{x}^{j}$ is non-vanishing. This assumption comes from the intuition that the linear subspace spanned by $\bm{x}^{j}$ is at most of rank $d$ and thus cannot span the whole space $\mathbb{R}^{n}.$ This assumption naturally holds in the univariate case where $\mathbf{X}=(1,\ldots,1)^{\mathrm{\scriptstyle T}}$ and we can take $\bm{u}=(1,\ldots,1)^{\mathrm{\scriptstyle T}}$ and $\alpha=1$ . More generally, $\|\mathbf{X}^{\mathrm{\scriptstyle T}}\bm{u}/n\|_{\min}=\min\{|\bm{u}^{\mathrm{\scriptstyle T}}\bm{x}^{1}|/n,\ldots,|\bm{u}^{\mathrm{\scriptstyle T}}\bm{x}^{d}|/n\}$ . Taking $|\bm{u}^{\mathrm{\scriptstyle T}}\bm{x}^{1}|/n$ for an example, since $\bm{u}\in\{-1,+1\}^{n}$ , we can assume that each coordinate of $\bm{x}^{1}$ is positive. In this case, $\bm{u}^{\mathrm{\scriptstyle T}}\bm{x}^{1}/n=\sum_{i=1}^{n}|x^{1}_{i}|/n\geq\min_{i}{|x_{i}^{1}|}$ , which is strictly positive with probability one, assuming $\bm{x}^{1}$ is drawn from a continuous distribution.

Together, the upper and lower bounds show that the adaptive Huber estimator achieves near-optimal deviations. Moreover, it indicates that the Huber estimator with an adaptive $\tau$ exhibits a sharp phase transition: when $\delta\geq 1$ , $\widehat{}\bm{\beta}_{\tau}$ converges to $\bm{\beta}^{*}$ at the parametric rate $n^{-1/2}$ , while only a slower rate of order $n^{-\delta/(1+\delta)}$ is available when the second moment does not exist.

Remark 4.

We provide a parallel analysis under random designs in the supplementary material. Beyond the nonasymptotic deviation bounds, we also prove a nonasymptotic Bahadur representation, which establishes a linear approximation of the nonlinear robust estimator. This result paves the way for future research on conducting statistical inference and constructing confidence sets under heavy-tailedness. Additionally, the proposed estimator achieves full efficiency: it is as efficient as the ordinary least squares estimator asymptotically, while the robustness is characterized via nonasymptotic performance.

3.2 Adaptive Huber Regression in High Dimensions

In this section, we study the regularized adaptive Huber estimator in high dimensions where $d$ is allowed to grow with the sample size $n$ exponentially. The analysis is carried out under fixed designs, and results for random designs are again provided in the supplementary material. We start with a modified version of the localized restricted eigenvalue introduced by Fan et al. (2018). Let $\mathbf{H}_{\tau}(\bm{\beta})=\nabla^{2}\mathcal{L}_{\tau}(\bm{\beta})$ denote the Hessian matrix. Recall that ${\mathcal{S}}={\rm supp}(\bm{\beta}^{*})\subseteq\{1,\ldots,d\}$ is the true support set with $|{\mathcal{S}}|=s$ .

Definition 2 (Localized Restricted Eigenvalue, LRE).

The localized restricted eigenvalue of $\mathbf{H}_{\tau}$ is defined as

[TABLE]

where $\mathcal{C}(m,\gamma,r):=\{(\bm{u},\bm{\beta})\in\mathbb{S}^{d-1}\times\mathbb{R}^{d}:\forall J\subseteq\{1,\ldots,d\}~{}{\textnormal{satisfying}}~{}S\subseteq J,|J|\leq m,\|\bm{u}_{J^{c}}\|_{1}\leq\gamma\|\bm{u}_{J}\|_{1},\|\bm{\beta}-\bm{\beta}^{*}\|_{1}\leq r\}$ is a local $\ell_{1}$ -cone.

The LRE is defined in a local neighborhood of $\bm{\beta}^{*}$ under $\ell_{1}$ -norm. This facilitates our proof, while Fan et al. (2018) use the $\ell_{2}$ -norm.

Condition 2.

$\mathbf{H}_{\tau}$ satisfies the localized restricted eigenvalue condition $\textnormal{LRE}(k,\gamma,r)$ , that is, $\kappa_{l}\leq\kappa_{-}(k,\gamma,r)\leq\kappa_{+}(k,\gamma,r)\leq\kappa_{u}$ for some constants $\kappa_{u},\kappa_{l}>0$ .

The condition above is referred to as the LRE condition (Fan et al., 2018). It is a unified condition for studying generalized loss functions, whose Hessians may possibly depend on $\bm{\beta}$ . For Huber loss, Condition 2 also involves the observation noise. The following definition concerns the restricted eigenvalues of $\mathbf{S}_{n}$ instead of $\mathbf{H}_{\tau}$ .

Definition 3 (Restricted Eigenvalue, RE).

The restricted maximum and minimum eigenvalues of $\mathbf{S}_{n}$ are defined respectively as

[TABLE]

where $\mathcal{C}(m,\gamma):=\{\bm{u}\in\mathbb{S}^{d-1}:\forall J\subseteq\{1,\ldots,d\}~{}{\rm satisfying}~{}S\subseteq J,|J|\leq m,\|\bm{u}_{J^{c}}\|_{1}\leq\gamma\|\bm{u}_{J}\|_{1}\}$ .

Condition 3.

$\mathbf{S}_{n}$ satisfies the restricted eigenvalue condition $\textnormal{RE}(k,\gamma)$ , that is, $\kappa_{l}\leq\rho_{-}(k,\gamma)\leq\rho_{+}(k,\gamma)\leq\kappa_{u}$ for some constants $\kappa_{u},\kappa_{l}>0$ .

To make Condition 2 on $\mathbf{H}_{\tau}$ practically useful, in what follows, we show that Condition 3 implies Condition 2 with high probability. As before, we write $v_{\delta}=n^{-1}\sum_{i=1}^{n}v_{i,\delta}$ and $L=\max_{1\leq i\leq n}\|\bm{x}_{i}\|_{\infty}$ .

Lemma 1.

Condition 3 implies Condition 2 with high probability: if $0<\kappa_{l}\leq\rho_{-}(k,\gamma)\leq\rho_{+}(k,\gamma)\leq\kappa_{u}<\infty$ for some $k\geq 1$ and $\gamma>0$ , then it holds with probability at least $1-e^{-t}$ that, $0<\kappa_{l}/2\leq\kappa_{-}(k,\gamma,r)\leq\kappa_{+}(k,\gamma,r)\leq\kappa_{u}<\infty$ provided $\tau\geq\max\{8Lr,c_{1}(L^{2}kv_{\delta})^{1/(1+\delta)}\}$ and $n\geq c_{2}L^{4}k^{2}t$ , where $c_{1},c_{2}>0$ are constants depending only on $(\gamma,\kappa_{l})$ .

With the above preparations in place, we are now ready to present the main results on the adaptive Huber estimator in high dimensions.

Theorem 3 (Upper Bound in High Dimensions).

Assume Condition 3 holds with $(k,\gamma)=(2s,3)$ , $v_{\delta}<\infty$ for some $0<\delta\leq 1$ . For any $t>0$ and $\tau_{0}\geq\nu_{\delta}$ , let $\tau=\tau_{0}(n/t)^{\max\{1/(1+\delta),1/2\}}$ , $\lambda\geq 4L\tau_{0}(t/n)^{\min\{\delta/(1+\delta),1/2\}}$ , and $r>12\kappa_{l}^{-1}s\lambda$ . Then with probability at least $1-(2s+1)e^{-t}$ , the $\ell_{1}$ -regularized Huber estimator $\widehat{\bm{\beta}}_{\tau,\lambda}$ defined in (4) satisfies

[TABLE]

as long as $n\geq C(L,\kappa_{l})s^{2}t$ for some $C(L,\kappa_{l})$ depending only on $(L,\kappa_{l})$ . In particular, with $t=(1+c)\log d$ for $c>0$ we have

[TABLE]

with probability at least $1-d^{-c}$ .

The above result demonstrates that the regularized Huber estimator with an adaptive robustification parameter converges at the rate $s^{1/2}\{(\log d)/n\}^{\min\{\delta/(1+\delta),1/2\}}$ with overwhelming probability. Provided the observation noise has finite variance, the proposed estimator performs as well as the Lasso with sub-Gaussian errors. We advocate the adaptive Huber regression method since sub-Gaussian condition often fails in practice (Wang, Peng and Li, 2015; Eklund, Nichols and Knutsson, 2016).

Remark 5.

As pointed out by a reviewer, if one pursues a sparsity-adaptive approach, such as the SLOPE (Bogdan et al., 2015; Bellec et al., 2018), the upper bound on $\ell_{2}$ -error can be improved from $\sqrt{s\log(d)/n}$ to $\sqrt{s\log(ed/s)/n}$ . With heavy-tailed observation noise, it is interesting to investigate whether this sharper bound can be achieved by Huber-type regularized estimator. We leave this to future work as a significant amount of additional work is still needed. On the other hand, since $\log(ed/s)=1+\log d-\log s$ and $s\leq n$ , $\log(ed/s)$ scales the same as $\log d$ so long as $\log d>a\log n$ for some $a>1$ .

Remark 6.

Analogously to the low dimensional case, here we impose the sample size scaling $n\gtrsim s^{2}\log d$ under fixed designs. In the supplementary material, we obtain minimax optimal $\ell_{1}$ -, $\ell_{2}$ - and prediction error bounds for $\widehat{\bm{\beta}}_{\tau,\lambda}$ with random designs under the scaling $n\gtrsim s\log d$ .

Finally, we establish a matching lower bound for estimating $\bm{\beta}^{*}$ . Recall the definition of $\mathcal{U}_{n}$ in Theorem 2.

Theorem 4 (Lower Bound in High Dimensions).

Assume that $\varepsilon_{i}$ are independent from some distribution in $\mathcal{P}_{\delta}^{v_{\delta}}$ . Suppose that Condition 3 holds with $k=2s$ and $\gamma=0$ . Further assume that there exists a set $\mathcal{A}$ with $|\mathcal{A}|=s$ and $\mathbf{u}\in\mathcal{U}_{n}$ such that $\|\mathbf{X}_{\mathcal{A}}^{\mathrm{\scriptstyle T}}\mathbf{u}/n\|_{\min}\geq\alpha$ for some $\alpha>0$ . Then, for any $A>0$ and $s$ -sparse estimator $\widehat{}\bm{\beta}=\widehat{\bm{\beta}}(y_{1},\ldots,y_{n},A)$ possibly depending on $A$ , we have

[TABLE]

as long as $n\geq 2(A\log d+\log 2)$ .

Together, Theorems 3 and 4 show that the regularized adaptive Huber estimator achieves the optimal rate of convergence in $\ell_{2}$ -error. The proof, which is given in the supplementary material, involves constructing a sub-class of binomial distributions for the regression errors. Unifying the results in low and high dimensions, we arrive at the claim (5) and thus the phase transition in Figure 1.

4 Extension to Heavy-tailed Designs

In this section, we extend the idea of adaptive Huber regression described in Section 2 to the case where both the covariate vector $\bm{x}$ and the regression error $\varepsilon$ exhibit heavy tails. We focus on the high dimensional regime $d\gg n$ , where $\bm{\beta}^{*}\in\mathbb{R}^{d}$ is sparse with $s=\|\bm{\beta}^{*}\|_{0}\ll n$ . Observe that, for Huber regression, the linear part of the Huber loss penalizes the residuals, and therefore robustifies the quadratic loss in the sense that outliers in the response space (caused by heavy-tailed observation noise) are down weighted or removed. Since no robustification is imposed on the covariates, intuitively, the adaptive Huber estimator may not be robust against heavy-tailed covariates. In what follows, we modify the adaptive Huber regression to robustify both the covariates and regression errors.

To begin with, suppose we observe independent data $\{(y_{i},\bm{x}_{i})\}_{i=1}^{n}$ from $(y,\bm{x})$ , which follows the linear model $y=\langle\bm{x},\bm{\beta}^{*}\rangle+\varepsilon$ . To robustify $\bm{x}_{i}$ , we define truncated covariates $\bm{x}^{\varpi}_{i}=(\psi_{\varpi}(x_{i1}),\ldots,\psi_{\varpi}(x_{id}))^{\mathrm{\scriptstyle T}}$ , where $\psi_{\varpi}(x):=\min\{\max(-\varpi,x),\varpi\}$ and $\varpi>0$ is a tuning parameter. Then we consider the modified adaptive Huber estimator (see Fan et al. (2016) for a general robustification principle)

[TABLE]

where $\mathcal{L}^{\varpi}_{\tau}(\bm{\beta})=n^{-1}\sum_{i=1}^{n}\ell_{\tau}(y_{i}-\langle\bm{x}^{\varpi}_{i},\bm{\beta}\rangle)$ and $\lambda>0$ is a regularization parameter.

Let ${\mathcal{S}}$ be the true support of $\bm{\beta}^{*}$ with sparsity $|{\mathcal{S}}|=s$ , and denote by $\mathbf{H}^{\varpi}_{\tau}(\bm{\beta})=\nabla^{2}\mathcal{L}^{\varpi}_{\tau}(\bm{\beta})$ the Hessian matrix of the modified Huber loss. To investigate the deviation property of $\widehat{\bm{\beta}}_{\tau,\varpi,\lambda}$ , we impose the following mild moment assumptions.

Condition 4.

(i) $\mathbb{E}(\varepsilon)=0$ , $\sigma^{2}=\mathbb{E}(\varepsilon^{2})>0$ and $v_{3}:=\mathbb{E}(\varepsilon^{4})<\infty$ ; (ii) The covariate vector $\bm{x}=(x_{1},\ldots,x_{d})^{\mathrm{\scriptstyle T}}\in\mathbb{R}^{d}$ is independent of $\varepsilon$ and satisfies $M_{4}:=\max_{1\leq j\leq d}\mathbb{E}(x_{j}^{4})<\infty$ .

We are now in place to state the main result of this section. Theorem 5 below demonstrates that the modified adaptive Huber estimator admits exponentially fast concentration when the convariates only have finite fourth moments, although at the cost of stronger scaling conditions.

Theorem 5.

Assume Condition 4 holds and let $\mathbf{H}^{\varpi}_{\tau}(\cdot)$ satisfy Condition 2 with $k=2s$ , $\gamma=3$ and $r>12\kappa_{l}^{-1}\lambda s$ . Then, the modified adaptive Huber estimator $\widehat{\bm{\beta}}_{\tau,\varpi,\lambda}$ given in (11) satisfies, on the event $\mathcal{E}(\tau,\varpi,\lambda)=\big{\{}\|(\nabla\mathcal{L}^{\varpi}_{\tau}(\bm{\beta}^{*}))_{{\mathcal{S}}}\|_{\infty}\leq\lambda/2\big{\}}$ , that

[TABLE]

For any $t>0$ , let the triplet $(\tau,\varpi,\lambda)$ satisfy

[TABLE]

where $v_{2}=\mathbb{E}(|\varepsilon|^{3})$ and $M_{2}=\max_{1\leq j\leq d}\mathbb{E}(x_{j}^{2})$ . Then $\mathbb{P}\{\mathcal{E}(\tau,\varpi,\lambda)\}\geq 1-2se^{-t}$ .

Remark 7.

Assume that the quantities $v_{3}$ , $M_{4}$ and $\|\bm{\beta}^{*}\|_{2}$ are all bounded. Taking $t\asymp\log d$ in (12), we see that $\widehat{\bm{\beta}}_{\tau,\varpi,\lambda}$ achieves a near-optimal convergence rate of order $s\sqrt{(\log d)/n}$ when the parameters $(\tau,\varpi,\lambda)$ scale as

[TABLE]

We remark here that the theoretically optimal $\tau$ is different from that in the sub-Gaussian design case. See Theorem B.2 in the supplementary material.

5 Algorithm and Implementation

This section is devoted to computational algorithm and numerical implementation. We focus on the regularized adaptive Huber regression in (4), as (3) can be easily solved via the iteratively reweighted least squares method. To solve the convex optimization problem in (4), standard optimization algorithms, such as the cutting-plane or interior point method, are not scalable to large-scale problems.

In what follows, we describe a fast and easily implementable method using the local adaptive majorize-minimization (LAMM) principle (Fan et al., 2018). We say that a function $g(\bm{\beta}|\bm{\beta}^{(k)})$ majorizes $f(\bm{\beta})$ at the point $\bm{\beta}^{(k)}$ if

[TABLE]

To minimize a general function $f(\bm{\beta})$ , a majorize-minimization (MM) algorithm initializes at $\bm{\beta}^{(0)}$ , and then iteratively computes $\bm{\beta}^{(k+1)}=\arg\min_{\bm{\beta}\in\mathbb{R}^{d}}g(\bm{\beta}|\bm{\beta}^{(k)})$ for $k=0,1,\ldots$ . The objective value of such an algorithm decreases in each step, since

[TABLE]

As pointed out by Fan et al. (2018), the majorization requirement only needs to hold locally at $\bm{\beta}^{(k+1)}$ when starting from $\bm{\beta}^{(k)}$ . We therefore locally majorize $\mathcal{L}_{\tau}(\bm{\beta})$ in (4) at $\bm{\beta}^{(k)}$ by an isotropic quadratic function

[TABLE]

where $\phi_{k}$ is a quadratic parameter such that $g_{k}(\bm{\beta}^{(k+1)}|\bm{\beta}^{(k)})\geq\mathcal{L}_{\tau}(\bm{\beta}^{(k+1)})$ . The isotropic form also allows a simple analytic solution to the subsequent majorized optimization problem:

[TABLE]

It can be shown that (14) is minimized at

[TABLE]

where $S(\mathbf{x},\lambda)$ is the soft-thresholding operator defined by $S(\mathbf{x},\lambda)=\text{sign}(x_{j})\max(|x_{j}|-\lambda,0)$ . The simplicity of this updating rule is due to the fact that (14) is an unconstrained optimization problem.

To find the smallest $\phi_{k}$ such that $g_{k}(\bm{\beta}^{(k+1)}|\bm{\beta}^{(k)})\geq\mathcal{L}_{\tau}(\bm{\beta}^{(k+1)})$ , the basic idea of LAMM is to start from a relatively small isotropic parameter $\phi_{k}=\phi_{k}^{0}$ and then successfully inflate $\phi_{k}$ by a factor $\gamma_{u}>1$ , say $\gamma_{u}=2$ . If the solution satisfies $g_{k}(\bm{\beta}^{(k+1)}|\bm{\beta}^{(k)})\geq\mathcal{L}_{\tau}(\bm{\beta}^{(k+1)})$ , we stop and obtain $\bm{\beta}^{(k+1)}$ , which makes the target value non-increasing. We then continue with the iteration to produce next solution until the solution sequence $\{\bm{\beta}^{(k)}\}_{k=1}^{\infty}$ converges. A simple stopping criterion is $\|\bm{\beta}^{(k+1)}-\bm{\beta}^{(k)}\|_{2}\leq\epsilon$ for a sufficiently small $\epsilon$ , say $10^{-4}$ . We refer to Fan et al. (2018) for a detailed complexity analysis of the LAMM algorithm.

6 Numerical Studies

6.1 Tuning Parameter and Finite Sample Performance

For numerical studies and real data analysis, in the case where the actual order of moments is unspecified, we presume the variance is finite and therefore choose robustification and regularization parameters as follows:

[TABLE]

where $\widehat{\sigma}^{2}=n^{-1}\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}$ with $\bar{y}=n^{-1}\sum_{i=1}^{n}y_{i}$ serves as a crude preliminary estimate of $\sigma^{2}$ , and the parameter $t$ controls the confidence level. We set $t=\log n$ for simplicity except for the phase transition plot. The constant $c_{\tau}$ and $c_{\lambda}$ are chosen via 3-fold cross-validation from a small set of constants, say $\{0.5,1,1.5\}$ .

We generate data from the linear model

[TABLE]

where $\varepsilon_{i}$ are i.i.d. regression errors and $\bm{\beta}^{*}=(5,-2,0,0,3,\underbrace{0,\ldots,0}_{d-5})^{\mathrm{\scriptstyle T}}\in\mathbb{R}^{d}.$ Independent of $\varepsilon_{i}$ , we generate $\bm{x}_{i}$ from standard multivariate normal distribution $\mathcal{N}({\bf 0},\mathbf{I}_{d})$ . In this section, we set $(n,d)=(100,5)$ , and generate regression errors from three different distributions: the normal distribution $\mathcal{N}(0,4)$ , the $t$ -distribution with degrees of freedom 1.5, and the log-normal distribution $\log\mathcal{N}(0,4)$ . Both $t$ and log-normal distributions are heavy-tailed, and produce outliers with high chance.

The results on $\ell_{2}$ -error for adaptive Huber regression and the least squares estimator, averaged over 100 simulations, are summarized in Table 1. In the case of normally distributed noise, the adaptive Huber estimator performs as well as the least squares. With heavy-tailed regression errors following Student’s $t$ or log-normal distribution, the adaptive Huber regression significantly outperforms the least squares. These empirical results reveal that adaptive Huber regression prevails across various scenarios: not only it provides more reliable estimators in the presence of heavy-tailed and/or asymmetric errors, but also loses almost no efficiency at the normal model.

6.2 Phase Transition

In this section, we validate the phase transition behavior of $\|\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\|_{2}$ empirically. We generate continuous responses according to (15), where $\bm{\beta}^{*}$ and $\bm{x}_{i}$ are set the same way as before. We sample independent errors as $\varepsilon_{i}\sim t_{\textnormal{df}}$ , Student’s $t$ -distribution with df degrees of freedom. Note that $t_{\textnormal{df}}$ has finite $(1+\delta)$ -th moments provided $\delta<\textnormal{df}-1$ and infinite df-th moment. Therefore, we take $\delta=\textnormal{df}-1-0.05$ throughout.

In low dimensions, we take $(n,d)=(500,5)$ and a sequence of degrees of freedoms (df’s): $\textnormal{df}\!\in\!\{1.1,1.2,\ldots,3.0\}$ ; in high dimensions, we take $(n,d)=(500,1000)$ , with the same choice of df’s. Tuning parameters $(\tau,\lambda)$ are calibrated similarly as before. Indicated by the main theorems, it holds

(Low dimension):

[TABLE] 2. 2.

(High dimension):

[TABLE]

which are approximately $\log(n)\times\delta/(1+\delta)$ and $\log(n/\log d)\times\delta/(1+\delta)$ , respectively, when $n$ is sufficiently large.

Figure 2 displays the negative $\log$ $\ell_{2}$ -error versus $\delta$ in both low and high dimensions over 200 repetitions for each $(n,d)$ combination. The empirically fitted curve closely resembles the theoretical curve displayed in Figure 1. These numerical results are in line with the theoretical findings, and empirically validate the phase transition of the adaptive Huber estimator.

We also compared the $\ell_{2}$ -error of the adaptive Huber estimator with that of the OLS estimator for $t$ -distributed errors with varying degrees of freedoms. As shown in Figure 3, adaptive Huber exhibits a significant advantage especially when $\delta$ is small. The OLS slowly catches up as $\delta$ increases.

6.3 Effective Sample Size

In this section, we verify the scaling behavior of $\|\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*}\|_{2}$ with respect to the effective sample size. The data are generated in the same way as before except that the errors are drawn from $t_{1.5}$ . As discussed in the previous subsection, we take $\delta=0.45$ and then choose the robustification parameter as $\tau=c_{\tau}\widehat{v}_{\delta}({n}/{\log d})^{1/(1+\delta)},$ where $\widehat{v}_{\delta}$ is the $(1+\delta)$ -th sample absolute central moment. For simplicity, we take $c_{\tau}=0.5$ here since our goal is to demonstrate the scaling behavior as $n$ grows, instead of to achieve the best finite-sample performance.

The left panel of Figure 4 plots the $\ell_{2}$ -error $\|\widehat{}\bm{\beta}_{\tau,\lambda}-\bm{\beta}^{*}\|_{2}$ versus sample size over 200 repetitions when the dimension $d\in\{100,500,5000\}$ . In all three settings, the $\ell_{2}$ -error decays as the sample size grows. As expected, the curves shift to the right when the dimension increases. Theorem 3 provides a specific prediction about this scaling behavior: if we plot the $\ell_{2}$ -error versus effective sample size ( $n/\log d$ ), the curves should align roughly with the theoretical curve

[TABLE]

for different values of $d$ . This is validated empirically by the right panel of Figure 4. This near-perfect alignment in Figure 4 is also observed by Wainwright (2009) for Lasso with sub-Gaussian errors.

6.4 A Real Data Example: NCI-60 Cancer Cell Lines

We apply the proposed methodologies to the NCI-60, a panel of 60 diverse human cancel cell lines. The NCI-60 consists of data on 60 human cancer cell lines and can be downloaded from http://discover.nci.nih.gov/cellminer/. More details on data acquisition can be found in Shankavaram et al. (2007). Our aim is to investigate the effects of genes on protein expressions. The gene expression data were obtained with an Affymetrix HG-U133A/B chip, $\log_{2}$ transformed and normalized with the guanine dytosine robust multi-array analysis. We then combined the same gene expression variables measured by multiple different probes into one by taking their median, resulting in a set of $p=17,924$ predictors. The protein expressions based on 162 antibodies were acquired via reverse-phase protein lysate arrays in their original scale. One observation had to be removed since all values were missing in the gene expression data, reducing the number of observations to $n=59$ .

We first center all the protein and gene expression variables to have mean zero, and then plot the histograms of the kurtosises of all expressions in Figure 5. The left panel in the figure shows that, 145 out of 162 protein expressions have kurtosises larger than 3; and 49 larger than 9. In other words, more than 89.5% of the protein expression variables have tails heavier than the normal distribution, and about 30.2% are severely heavy-tailed with tails flatter than $t_{5}$ , the $t$ -distribution with 5 degrees of freedom. Similarly, about 36.5% of the gene expression variables, even after the $\log_{2}$ -transformation, still exhibit empirical kurtosises larger than that of $t_{5}$ . This suggests that, regardless of the normalization methods used, genomic data can still exhibit heavy-tailedness, which was also pointed out by Purdom and Holmes (2005).

We order the protein expression variables according to their scales, measured by the standard deviation. We show the results for the protein expressions based on the KRT19 antibody, the protein keratin 19, which constitutes the variable with the largest standard deviation, serving as one dependent variable. KRT19, a type I keratin, also known as Cyfra 21-1, is encoded by the KRT19 gene. Due to its high sensitivity, the KRT19 antibody is the most used marker for the tumor cells disseminated in lymph nodes, peripheral blood, and bone marrow of breast cancer patients (Nakata et al., 2004). We denote the adaptive Huber regression as AHuber, and that with truncated covariates as TAHuber. We then compare AHuber and TAHuber with Lasso. Both regularization and robustification parameters are chosen by the ten-fold cross-validation.

To measure the predictive performance, we consider a robust prediction loss: the mean absolute error (MAE) defined as

[TABLE]

where $y^{\textnormal{test}}_{i}$ and $\bm{x}^{\textnormal{test}}_{i}$ , $i=1,\ldots,n_{\textnormal{test}}$ , denote the observations of the response and predictor variables in the test data, respectively. We report the MAE via the leave-one-out cross-validation. Table 2 reports the MAE, model size and selected genes for the considered methods. TAHuber clearly shows the smallest MAE, followed by AHuber and Lasso. The Lasso produces a fairly large model despite the small sample. Now it has been recognized that Lasso tends to select many noise variables along with the significant ones, especially when data exhibit heavy tails.

The Lasso selects a model with 42 genes but excludes the KRT19 gene, which encodes the protein keratin 19. AHuber finds 11 genes including KRT19. TAHuber results in a model with 7 genes: KRT19, MT1E, ARHGAP29, MALL, ANXA3, MAL2, BAMBI. First, KRT19 encodes the keratin 19 protein. It has been reported in Wu et al. (2008) that the MT1E expression is positively correlated with cancer cell migration and tumor stage, and the MT1E isoform was found to be present in estrogen receptor-negative breast cancer cell lines (Friedline et al., 1998). ANXA3 is highly expressed in all colon cell lines and all breast-derived cell lines positive for the oestrogen receptor (Ross et al., 2000). A very recent study in Zhou et al. (2017) suggested that silencing the ANXA3 expression by RNA interference inhibits the proliferation and invasion of breast cancer cells. Moreover, studies in Shangguan et al. (2012) and Kretzschmar (2000) showed that the BAMBI transduction significantly inhibited TGF- $\beta$ /Smad signaling and expression of carcinoma-associated fibroblasts in human bone marrow mesenchymal stem cells (BM-MSCs), and disrupted the cytokine network mediating the interaction between MSCs and breast cancer cells. Consequently, the BAMBI transduction abolished protumor effects of BM-MSCs in vitro and in an orthotopic breast cancer xenograft model, and instead significantly inhibited growth and metastasis of coinoculated cancer. MAL2 expressions were shown to be elevated at both RNA and protein levels in breast cancer (Shehata et al., 2008). It has also been shown that MALL is associated with various forms of cancer (Oh et al., 2005; Landi et al., 2014). However, the effect of ARHGAP29 and MALL on breast cancer remains unclear and is worth further investigation.

Supplementary Materials

In the supplementary materials, we provide theoretical analysis under random designs, and proofs of all the theoretical results in this paper.

Acknowledgments

The authors thank the Editor, Associate Editor, and two anonymous referees for their valuable comments. This work is supported by a Connaught Award, NSERC Grant RGPIN-2018-06484, NSF Grants DMS-1662139, DMS-1712591, and DMS-1811376, NIH Grant 2R01-GM072611-14, and NSFC Grant 11690014.

Appendix

Appendix A A Lepski-type method

Adapting the unknown robustification parameter depends on the value of the variance provided it exists. Through Lepski’s renowned adaptation method (Lepski, 1991), this can be done without actually knowing the variance in advance. Assume that $v_{1}=n^{-1}\sum_{i=1}^{n}\mathbb{E}(\varepsilon_{i}^{2})<\infty$ and let $\sigma_{\max},\sigma_{\min}>0$ be such that $\sigma_{\min}\leq v_{1}^{1/2}\leq\sigma_{\max}.$ Here, parameters $\sigma_{\max}$ and $\sigma_{\min}$ serve as crude preliminary upper and lower bounds for $v_{1}^{1/2}$ , respectively.

For a prespecified $a>1$ , let $\sigma_{j}=\sigma_{\min}a^{j}$ and define the set

[TABLE]

with its cardinality satisfying ${\rm card}(\mathcal{J})\leq 1+\log_{a}(\sigma_{\max}/\sigma_{\min})$ . For every predetermined $t>0$ , compute a collection of Huber estimators $\{\widehat{\bm{\beta}}_{\tau_{j}}\}_{j\in\mathcal{J}}$ , where $\tau_{j}=\sigma_{j}(n/t)^{1/2}$ for $j\in\mathcal{J}$ . Set

[TABLE]

where $\widetilde{L}:=\max_{1\leq i\leq n}\|\mathbf{S}_{n}^{-1/2}\bm{x}_{i}\|_{\infty}$ assuming $\mathbf{S}_{n}=n^{-1}\sum_{i=1}^{n}\bm{x}_{i}\bm{x}_{i}^{\mathrm{\scriptstyle T}}$ is positive definite. The final data-driven estimator is then defined as $\widehat{\bm{\beta}}=\widehat{\bm{\beta}}_{\tau_{\widehat{j}}}$ .

Theorem 6.

For any $t>0$ , the data-dependent estimator $\widehat{}\bm{\beta}$ satisfies the bound

[TABLE]

with probability at least $1-(2d+1)\log_{a}(a\sigma_{\max}/\sigma_{\min})e^{-t}$ , provided the sample size satisfies $n\geq 8\max(4\widetilde{L}^{2}d,\widetilde{L}^{4}d^{2})t$ .

Lepski-type construction relies on preliminary crude upper and lower bounds for $v_{1}^{1/2}$ , which are usually unknown in advance. In practice, one can take $\sigma_{\min}=\widehat{\sigma}/K$ and $\sigma_{\max}=K\widehat{\sigma}$ for some $K>1$ , where $\widehat{\sigma}^{2}:=(n-d)^{-1}\sum_{i=1}^{n}(y_{i}-\langle\bm{x}_{i},\widehat{}\bm{\beta}^{{\rm ols}}\rangle)^{2}$ and $\widehat{}\bm{\beta}^{{\rm ols}}$ is the least squares estimator. Moreover, one may choose $a=1.5$ and $t=\log n$ or $\log(nd)$ . However, the effectiveness of this method depends on how sharp the constants are in the theoretical bounds. We note that all constants in Theorems 1 and 6 are explicit, although they might not be sharp. Finding sharp constants remains open. Since the current content already consists of long and technical arguments, we will not pursue this particular goal in this paper.

Proof of Theorem 6.

Following the proof of Theorem 1 which is given in Appendix C, it can be similarly proved that, for any $\tau=\tau_{0}(n/t)^{1/2}$ with $\tau_{0}\geq v_{1}^{1/2}$ ,

[TABLE]

with probability at least $1-(2d+1)e^{-t}$ as long as $n\geq 8\max(4\widetilde{L}^{2}d,\widetilde{L}^{4}d^{2})t$ .

Let $j^{*}=\min\{j\in\mathcal{J}:\sigma_{j}\geq v_{1}^{1/2}\}$ and note that $v_{1}^{1/2}\leq\sigma_{j^{*}}\leq av_{1}^{1/2}$ . By the definition of $\widehat{j}$ ,

[TABLE]

Define the event

[TABLE]

such that $\mathcal{E}\subseteq\{\widehat{j}\leq j^{*}\}$ . From (17) we see that for each $j\geq j^{*}$ ,

[TABLE]

with probability at least $1-(2d+1)e^{-t}$ under the prescribed sample size scaling. By the union bound, we obtain that

[TABLE]

On the event $\mathcal{E}$ , $\widehat{j}\leq j^{*}$ and thus

[TABLE]

Together, the last two displays yield (16). ∎

Appendix B Random Design Analysis

In this section, we derive counterparts of the results in Section 3 under random designs. First we impose the following moment conditions on the covariates and regression errors.

Condition 5.

In linear model (2), the covariate vectors $\bm{x}_{i}\in\mathbb{R}^{d}$ are i.i.d. from a sub-Gaussian random vector $\bm{x}$ , i.e. $\mathbb{P}(|\langle\bm{u},\widetilde{\bm{x}}\rangle|\geq y)\leq 2\exp(-y^{2}\|\bm{u}\|_{2}^{2}/A_{0}^{2})$ for all $y\in\mathbb{R}$ and $\bm{u}\in\mathbb{R}^{d}$ , where $\widetilde{\bm{x}}=\bm{\Sigma}^{-1/2}\bm{x}$ with $\bm{\Sigma}=(\sigma_{jk})_{1\leq j,k\leq d}=\mathbb{E}(\bm{x}\bm{x}^{\mathrm{\scriptstyle T}})$ being positive definite and $A_{0}>0$ is a constant. The regression errors $\varepsilon_{i}$ are independent and satisfy $\mathbb{E}(\varepsilon_{i}|\bm{x}_{i})=0$ and $v_{i,\delta}=\mathbb{E}(|\varepsilon_{i}|^{1+\delta}|\bm{x}_{i})<\infty$ almost surely for some $\delta>0$ .

Throughout this section, for simplicity, we assume the independent regression errors $\varepsilon_{i}$ in model (2) are homoscedastic in the sense that $v_{i,\delta}$ does not depend on $\bm{x}_{i}$ . The conditional heteroscedastic model can be allowed with slight modifications as before. With this setup, we write

[TABLE]

Assuming the $d\times d$ matrix $\bm{\Sigma}=\mathbb{E}(\bm{x}\bm{x}^{\mathrm{\scriptstyle T}})$ is positive definite, we use $\|\cdot\|_{\bm{\Sigma},2}$ to denote the rescaled $\ell_{2}$ -norm on $\mathbb{R}^{d}$ :

[TABLE]

Moreover, we use $\psi_{\tau}$ to denote the derivative of Huber loss, that is,

[TABLE]

B.1 Huber regression in low dimensions

In the low dimensional regime “ $d\ll n$ ”, we consider the Huber estimator

[TABLE]

where $\mathcal{L}_{\tau}(\bm{\beta})=n^{-1}\sum_{i=1}^{n}\ell_{\tau}(y_{i}-\langle\bm{x}_{i},\bm{\beta}\rangle)$ is the empirical Huber loss function and $\tau>0$ is the robustification parameter. Under Condition 5, the following theorem provides (i) exponential-type concentration inequalities for $\widehat{}\bm{\beta}_{\tau}$ when $\tau$ is properly calibrated, and (ii) a nonasymptotic Bahadur representation result under the finite variance condition on regression errors, i.e. $\delta=1$ .

Theorem 7.

Suppose Condition 5 holds.

(I)

For any $t>0$ and $\tau_{0}\geq\nu_{\delta}$ , the estimator $\widehat{\bm{\beta}}_{\tau}$ with $\tau=\tau_{0}\{n/(d+t)\}^{\max\{1/(1+\delta),1/2\}}$ satisfies

[TABLE]

as long as $n\geq C_{2}(d+t)$ , where $C_{1},C_{2}>0$ depend only on $A_{0}$ .

(II)

Assume that $v_{1}<\infty$ . For any $t>0$ and $\tau_{0}\geq v_{1}^{1/2}$ , the estimator $\widehat{\bm{\beta}}_{\tau}$ with $\tau=\tau_{0}\sqrt{n/(d+t)}$ satisfies

[TABLE]

provided $n\geq C_{2}(d+t)$ , where $C_{3}>0$ depends only on $A_{0}$ .

With random designs, the first part of Theorem 7 provides concentration inequalities for the $\ell_{2}$ -error under finite $(1+\delta)$ -th moment conditions with $\delta>0$ ; when the second moments are finite, the second part gives a finite-sample approximation of $\widehat{\bm{\beta}}_{\tau}-\bm{\beta}^{*}$ by a sum of independent random vectors. The remainder of such an approximation exhibits sub-exponential tails. Unlike the least squares estimator, the adaptive Huber estimator does not admit an explicit closed-form representation, which causes the main difficulty for analyzing its asymptotic and nonasymptotic properties. Theorem 7 reveals that, up to a higher-order remainder, the distributional property of $\widehat{}\bm{\beta}_{\tau}$ mainly depends on a linear stochastic term that is much easier to deal with.

Regarding the truncated random variable $\psi_{\tau}(\varepsilon_{i})$ , the following result shows that the differences between the first two moments of $\psi_{\tau}(\varepsilon_{i})$ and $\varepsilon_{i}$ depend on both $\tau$ and the moments of $\varepsilon_{i}$ . The higher moment $\varepsilon_{i}$ has, the faster these differences decay as a function of $\tau$ . We summarize this observation in the following proposition. We drop $i$ for ease of presentation.

Proposition 2.

Assume that $\mathbb{E}(\varepsilon)=0$ , $\sigma^{2}=\mathbb{E}(\varepsilon^{2})>0$ and $\mathbb{E}(|\varepsilon|^{2+\kappa})<\infty$ from some $\kappa\geq 0$ . Then we have

[TABLE]

Moreover, if $\kappa>0$ ,

[TABLE]

Proposition 2, along with Theorem 7, shows that the adaptive Huber estimator achieves nonasymptotic robustness against heavy-tailed errors, while enjoying high efficiency when $\tau$ diverges to $\infty$ . In particular, taking $t=\log n$ , we see that under the scaling $n\gtrsim d$ , the robust estimator $\widehat{\bm{\beta}}_{\tau}$ with $\tau\asymp\sqrt{n/(d+\log n)}$ satisfies

[TABLE]

with probability at least $1-O(n^{-1})$ . From an asymptotic point of view, this implies that if the dimension $d$ , as a function of $n$ , satisfies

[TABLE]

then for any deterministic vector $\bm{a}\in\mathbb{R}^{d}$ , the distribution of $\langle\bm{a},\widehat{\bm{\beta}}_{\tau}-{\bm{\beta}}^{*}\rangle$ is close to that of $n^{-1}\sum_{i=1}^{n}\psi_{\tau}(\varepsilon_{i})\langle\bm{a},\bm{\Sigma}^{-1}\bm{x}_{i}\rangle$ . If $\varepsilon_{1},\ldots,\varepsilon_{n}$ are independent from $\varepsilon$ with variance $\sigma^{2}$ and $\mathbb{E}(|\varepsilon|^{2+\kappa})<\infty$ for some $\kappa>0$ , taking $\tau\asymp\sqrt{n/(d+\log n)}$ in Proposition 2 implies that $n^{-1/2}\sum_{i=1}^{n}\psi_{\tau}(\varepsilon_{i})\langle\bm{a},\bm{\Sigma}^{-1}\bm{x}_{i}\rangle$ follows a normal distribution with mean zero and variance $\sigma^{2}\|\bm{\Sigma}^{-1/2}\bm{a}\|_{2}^{2}$ asymptotically.

B.2 Huber regression in high dimensions

In the high dimensional setting where $d\gg n$ and $s=\|\bm{\beta}^{*}\|_{0}\ll n$ , we investigate the $\ell_{1}$ -regularized Huber estimator

[TABLE]

under Condition 5, where $\tau$ and $\lambda$ represent, respectively, the robustification and regularization parameters.

Theorem 8.

Assume Condition 5 holds and that the unknown $\bm{\beta}^{*}$ is sparse with $s=\|\bm{\beta}^{*}\|_{0}$ . Then any optimal solution $\widehat{}\bm{\beta}_{\tau,\lambda}$ to the convex program (21) with

[TABLE]

and $\lambda$ scaling as $A_{0}\sigma_{\max}\tau_{0}\{(\log d)/n\}^{\min\{\delta/(1+\delta),1/2\}}$ satisfies the bounds

[TABLE]

with probability at least $1-3d^{-1}$ as long as $n\geq C\kappa_{l}^{-1}\sigma_{\max}^{2}s\log d$ , where $C>0$ is a constant only depending on $A_{0}$ , $\sigma_{\max}=\max_{1\leq j\leq d}\sigma_{jj}^{1/2}$ and $\kappa_{l}=\lambda_{\min}(\bm{\Sigma})$ .

Provided the distribution of $\varepsilon_{i}$ has finite variance, i.e. $\delta=1$ , Theorem 8 asserts that the $\ell_{1}$ -regularized Huber regression with properly tuned $(\tau,\lambda)$ gives rise to statistically consistent estimators with $\ell_{1}$ - and $\ell_{2}$ -errors scaling as $s\sqrt{(\log d)/n}$ and $\sqrt{s(\log d)/n}$ , respectively, under the sample size scaling $n\gtrsim s\log d$ . These rates are the minimax rates enjoyed by the standard Lasso with Gaussian/sub-Gaussian errors (Bickel, Ritov and Tsybakov, 2009; Wainwright, 2009).

The results of Theorem 8 are useful complements to those in Theorem 4 under fixed designs. Taking $t=\log d$ therein, we see that the $\ell_{2}$ -error bound in (10) almost coincides with that in (23) up to constant factors. The sample size scaling under random designs is optimal and better than the scaling under fixed designs: the former is of order $O(s\log d)$ , while the latter is of order $O(s^{2}\log d)$ . Technically, the sample size scaling is required to ensure the restricted strong convexity of Huber loss in a neighborhood of $\bm{\beta}^{*}$ ; see Lemma 1 in the main text and Lemma 5 below. Since most existing works on analyzing high dimensional $M$ -estimators beyond the least squares have focused on random designs (see, e.g. Belloni and Chernozhukov (2011), Negahban et al. (2012) and the references therein), it is not clear what the optimal sample size scaling is under fixed designs, although it is possible that the additional $s$ factor in Theorem 4 is purely an artifact of the proof technique. We refer to van de Geer (2008) for a study of generalized linear models in high dimensions. To achieve the oracle rate for the excess risk, the sparsity $s$ is required to be of order $O(\sqrt{n/\log n})$ , or equivalently, the required sample size scales as $s^{2}\log n$ .

We complete this section by a prediction error bound for $\widehat{}\bm{\beta}_{\tau,\lambda}$ , which is a direct consequence of Theorem 8.

Corollary 1.

Under the conditions of Theorem 8, it holds

[TABLE]

with probability at least $1-5d^{-1}$ , where $\mathbf{X}=(\bm{x}_{1},\ldots,\bm{x}_{n})^{\mathrm{\scriptstyle T}}$ is the $n\times d$ design matrix.

Appendix C Proofs of Main Theorems

Throughout the proofs, we use $\psi_{\tau}=\ell^{\prime}_{\tau}$ as in definition (18) and let $\|\cdot\|_{\bm{\Sigma},2}$ be the rescaled $\ell_{2}$ -norm on $\mathbb{R}^{d}$ given by $\|\bm{u}\|_{\bm{\Sigma},2}=\|\bm{\Sigma}^{1/2}\bm{u}\|_{2}$ for $\bm{u}\in\mathbb{R}^{d}$ .

C.1 Auxiliary Lemmas

First we collect several auxiliary lemmas. Our first lemma concerns the localized analysis that can be utilized to remove the parameter constraint in previous works. It is established in Fan et al. (2018) and we reproduce it here for completeness.

Lemma 2.

Let $D_{\mathcal{L}}(\bm{\beta}_{1},\bm{\beta}_{2})=\mathcal{L}(\bm{\beta}_{1})-\mathcal{L}(\bm{\beta}_{2})-\langle\nabla\mathcal{L}(\bm{\beta}_{2}),\bm{\beta}_{1}-\bm{\beta}_{2}\rangle$ and $D_{\mathcal{L}}^{s}(\bm{\beta}_{1},\bm{\beta}_{2})=D_{\mathcal{L}}(\bm{\beta}_{1},\bm{\beta}_{2})+D_{\mathcal{L}}(\bm{\beta}_{2},\bm{\beta}_{1})$ . For $\bm{\beta}_{\eta}=\bm{\beta}^{*}+\eta(\bm{\beta}-\bm{\beta}^{*})$ with $\eta\in(0,1]$ and any convex loss functions $\mathcal{L}$ , we have

[TABLE]

Proof of Lemma 2.

Let $Q(\eta)=D_{\mathcal{L}}(\bm{\beta}_{\eta},\bm{\beta}^{*})=\mathcal{L}(\bm{\beta}_{\eta})-\mathcal{L}(\bm{\beta}^{*})-\langle\nabla\mathcal{L}(\bm{\beta}^{*}),\bm{\beta}_{\eta}-\bm{\beta}^{*}\rangle$ . Noting that the derivative of $\mathcal{L}(\bm{\beta}_{\eta})$ with respect to $\eta$ is $\frac{d}{d\eta}\mathcal{L}(\bm{\beta}_{\eta})=\langle\nabla\mathcal{L}(\bm{\beta}_{\eta}),\bm{\beta}-\bm{\beta}^{*}\rangle$ , we have

[TABLE]

Then, the symmetric Bregman divergence $D_{\mathcal{L}}^{s}(\bm{\beta}_{\eta}-\bm{\beta}^{*})$ can be written as

[TABLE]

Taking $\eta=1$ in the above equation, we have $Q^{\prime}(1)=D_{\mathcal{L}}^{s}(\bm{\beta},\bm{\beta}^{*})$ as a special case. If $Q(\eta)$ is convex, then $Q^{\prime}(\eta)$ is non-decreasing and thus

[TABLE]

It remains to show the convexity of $\eta\in[0,1]\mapsto Q(\eta)$ ; or equivalently, the convexity of $\mathcal{L}(\bm{\beta}_{\eta})$ and $\langle\nabla\mathcal{L}(\bm{\beta}^{*}),\bm{\beta}^{*}-\bm{\beta}_{\eta}\rangle$ , respectively. First, note that $\bm{\beta}_{\eta}$ , as a function of $\eta$ , is linear in $\eta$ , that is, $\bm{\beta}_{\alpha_{1}\eta_{1}+\alpha_{2}\eta_{2}}=\alpha_{1}\bm{\beta}_{\eta_{1}}+\alpha_{2}\bm{\beta}_{\eta_{2}}$ for all $\eta_{1},\eta_{2}\in[0,1]$ and $\alpha_{1},\alpha_{2}\geq 0$ satisfying $\alpha_{1}+\alpha_{2}=1$ . Then, the convexity of $\eta\mapsto\mathcal{L}(\bm{\beta}_{\eta})$ follows from this linearity and the convexity of the Huber loss. The convexity of the second term follows directly from the bi-linearity of the inner product. ∎

The following two lemmas provide restricted strong convexity properties for the Huber loss in a local vicinity of the true parameter under both fixed and random designs.

Lemma 3.

Assume that Condition 1 holds and that $v_{\delta}=n^{-1}\sum_{i=1}^{n}\mathbb{E}(|\varepsilon_{i}|^{1+\delta})<\infty$ for some $0<\delta\leq 1$ . Then for any $t,r>0$ , the Hessian matrix $\nabla^{2}\mathcal{L}_{\tau}(\bm{\beta})$ with $\tau>2Mr$ satisfies that, with probability greater than $1-e^{-t}$ ,

[TABLE]

where $M=\max_{1\leq i\leq n}\|\bm{x}_{i}\|_{2}$ .

Proof of Lemma 3.

To begin with, note that

[TABLE]

where $\mathbf{S}_{n}$ is given in Condition 1. For each $\bm{\beta}\in\mathbb{R}^{d}$ , define its centered and rescaled version $\bm{\beta}_{0}=\bm{\beta}-\bm{\beta}^{*}$ such that $y_{i}-\langle\bm{x}_{i},\bm{\beta}\rangle=\varepsilon_{i}-\langle\bm{x}_{i},\bm{\beta}_{0}\rangle$ . Using the inequality that

[TABLE]

we have, for any $\bm{u}\in\mathbb{S}^{d-1}$ and $\bm{\beta}\in\mathbb{R}^{d}$ satisfying $\|\bm{\beta}_{0}\|_{2}\leq r$ ,

[TABLE]

provided that $\tau>2Mr$ . For any $z\geq 0$ , it follows from Hoeffding’s inequality that, with probability at least $1-e^{-2nz^{2}}$ ,

[TABLE]

This, together with the inequality $\mathbb{P}(|\varepsilon_{i}|>\tau/2)\leq(2/\tau)^{1+\delta}v_{i,\delta}$ and Condition 1, implies that, with probability at least $1-e^{-2nz^{2}}$ ,

[TABLE]

This proves (25) immediately by taking $z=\sqrt{t/(2n)}$ . ∎

Lemma 4.

Assume $v_{\delta}<\infty$ for some $0<\delta\leq 1$ and $(\mathbb{E}\langle\bm{u},\widetilde{\bm{x}}\rangle^{4})^{1/4}\leq A_{1}\|\bm{u}\|_{2}$ for all $\bm{u}\in\mathbb{R}^{d}$ and some constant $A_{1}>0$ . Moreover, let $\tau,r>0$ satisfy

[TABLE]

Then with probability at least $1-e^{-t}$ ,

[TABLE]

uniformly over $\bm{\beta}\in\Theta_{0}(r)=\{\bm{\beta}\in\mathbb{R}^{d}:\|\bm{\beta}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}\leq r\}$ .

Proof of Lemma 4.

To begin with, note that

[TABLE]

where $1\{\mathcal{E}_{i}\}$ denotes the indication function of the event

[TABLE]

On $\mathcal{E}_{i}$ , it holds $|y_{i}-\langle\bm{x}_{i},\bm{\beta}\rangle|\leq|\varepsilon_{i}|+|\langle\bm{x}_{i},\bm{\beta}-\bm{\beta}^{*}\rangle|\leq\tau/2+\tau/2=\tau$ for all $\bm{\beta}\in\Theta_{0}(r)$ . Since $\psi_{\tau}^{\prime}(x)=1$ for $|x|\leq\tau$ , the right-hand of (28) can be bounded from below by

[TABLE]

To bound the right-hand of (29), the main difficulty is that the indicator function is non-smooth. To deal with this issue, we define the following “smoothed” functions: for any $R>0$ , write

[TABLE]

It is easy to see that the function $\phi_{R}$ is $R$ -Lipschitz and satisfies

[TABLE]

Together, (28), (29) and (30) imply

[TABLE]

For $r>0$ , define $\Delta(r)=\sup_{\bm{\beta}\in\Theta_{0}(r)}|g(\bm{\beta})-\mathbb{E}g(\bm{\beta})|/\|\bm{\beta}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}^{2}$ , such that

[TABLE]

for all $\bm{\beta}\in\Theta_{0}(r)$ . In the following, we establish lower and upper bounds for $\mathbb{E}g(\bm{\beta})$ and $\Delta(r)$ , respectively, starting with the former.

For $\bm{\beta}\in\mathbb{R}^{d}$ , write $\bm{\delta}=\bm{\beta}-\bm{\beta}^{*}$ . By (31) and Markov’s inequality,

[TABLE]

Provided $\tau\geq 2\max\{(4v_{\delta})^{1/(1+\delta)},4A_{1}^{2}r\}$ ,

[TABLE]

Next we bound the supremum $\Delta(r)$ . Write $g(\bm{\beta})=n^{-1}\sum_{i=1}^{n}g_{i}(\bm{\beta})$ . Noting that $0\leq\phi_{R}(x)\leq R^{2}/4$ and $0\leq\varphi(y)\leq 1$ , we have

[TABLE]

By Theorem 7.3 in Bousquet (2003), for any $x>0$ , $\Delta(r)$ satisfies the bound

[TABLE]

with probability at least $1-e^{-x}$ , where by (30),

[TABLE]

For the expected value $\mathbb{E}\Delta(r)$ , using the symmetrization inequality and the connection between Gaussian complexity and Rademacher complexity, we obtain that $\mathbb{E}\Delta(r)\leq\sqrt{2\pi}\,\mathbb{E}\{\sup_{\bm{\beta}\in\Theta_{0}(r)}|\mathbb{G}_{\bm{\beta}}|\}$ , where

[TABLE]

and $G_{i}$ are i.i.d. standard normal random variables that are independent of $\{(y_{i},\bm{x}_{i})\}_{i=1}^{n}$ . Let $\mathbb{E}^{*}$ be the conditional expectation given $\{(y_{i},\bm{x}_{i})\}_{i=1}^{n}$ . Since $\{\mathbb{G}_{\bm{\beta}}:\bm{\beta}\in\Theta_{0}(r)\}$ is a conditional Gaussian process, for any $\bm{\beta}_{0}\in\Theta_{0}(r)$ we have

[TABLE]

Further, taking the expectation with respect to $\{(y_{i},\bm{x}_{i})\}_{i=1}^{n}$ on both sides, (35) remains valid with $\mathbb{E}^{*}$ replaced by $\mathbb{E}$ . We write $\bm{\beta}^{*}$ as $(\beta^{*}_{1},\widetilde{}\bm{\beta}^{*\mathrm{\scriptstyle T}})^{\mathrm{\scriptstyle T}}$ with $\beta^{*}_{1}$ denoting the first coordinate of $\bm{\beta}^{*}$ and $\widetilde{}\bm{\beta}^{*}\in\mathbb{R}^{d-1}$ . Recalling $\phi_{R}(u)\leq\min(u^{2},R^{2}/4)$ , we take $\bm{\beta}_{0}=(\beta^{*}_{1}+(\mathbb{E}x_{1}^{2})^{-1/2}r,\widetilde{}\bm{\beta}^{*\mathrm{\scriptstyle T}})^{\mathrm{\scriptstyle T}}$ so that $\|\bm{\beta}_{0}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}=r$ and $\mathbb{E}|\mathbb{G}_{\bm{\beta}_{0}}|\leq(\mathbb{E}\mathbb{G}_{\bm{\beta}_{0}}^{2})^{1/2}\leq(4r)^{-1}\tau n^{-1/2}$ . To bound the conditional expectation $\mathbb{E}^{*}\{\sup_{\bm{\beta}\in\Theta_{0}(r)}\mathbb{G}_{\bm{\beta}}\}$ in (35), we employ the Gaussian comparison theorem as in the proof of Lemma 11 in Loh and Wainwright (2015)

Denote by $\textnormal{var}^{*}$ the conditional variance given $\{(y_{i},\bm{x}_{i})\}_{i=1}^{n}$ . For $\bm{\beta},\bm{\beta}^{\prime}\in\Theta_{0}(r)$ , write $\bm{\delta}=\bm{\beta}-\bm{\beta}^{*}$ and $\bm{\delta}^{\prime}=\bm{\beta}^{\prime}-\bm{\beta}^{*}$ . By conditional normality, we quickly compute and bound the variance of $\mathbb{G}_{\bm{\beta}}-\mathbb{G}_{\bm{\beta}^{\prime}}$ :

[TABLE]

Using the property $\phi_{cR}(cx)=c^{2}\phi_{R}(x)$ for any $c>0$ , we find that

[TABLE]

It follows from the above calculations and the Lipschitz property of $\phi_{R}$ that

[TABLE]

Let $G_{1}^{\prime},\ldots,G_{n}^{\prime}$ be i.i.d. standard normal random variables that are independent of all the previous variables, and define a new process

[TABLE]

As an immediate consequence of (36), we have $\textnormal{var}^{*}(\mathbb{G}_{\bm{\beta}}-\mathbb{G}_{\bm{\beta}^{\prime}})\leq\textnormal{var}^{*}(\mathbb{Z}_{\bm{\beta}}-\mathbb{Z}_{\bm{\beta}^{\prime}})$ . Therefore, by the Gaussian comparison inequality (Ledoux and Talagrand, 1991),

[TABLE]

where $\widetilde{\bm{x}}_{i}=\bm{\Sigma}^{-1/2}\bm{x}_{i}$ . Taking the expectation with respect to $\{(y_{i},\bm{x}_{i})\}_{i=1}^{n}$ on both sides gives $\mathbb{E}\{\sup_{\bm{\beta}\in\Theta_{0}(r)}\mathbb{G}_{\bm{\beta}}\}\leq(\tau/r)\mathbb{E}\|n^{-1}\sum_{i=1}^{n}G_{i}^{\prime}\widetilde{\bm{x}}_{i}\|_{2}\leq(\tau/r)\sqrt{d/n}$ . From this and the unconditional version of (35), we obtain

[TABLE]

Together, (34) with $x=t$ and (37) imply that as long as $n\gtrsim(\tau/r)^{2}(d+t)$ , $\Delta(r)\leq 1/4$ with probability at least $1-e^{-t}$ . Combining this with (29) and (33) proves the stated result. ∎

Recall that $\Theta_{0}(r)=\{\bm{\beta}\in\mathbb{R}^{d}:\|\bm{\beta}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}\leq r\}$ . Let $\mathcal{C}=\{\bm{\beta}\in\mathbb{R}^{d}:\|(\bm{\beta}-\bm{\beta}^{*})_{{\mathcal{S}}^{{\rm c}}}\|_{1}\leq 3\|(\bm{\beta}-\bm{\beta}^{*})_{{\mathcal{S}}}\|_{1}\}$ be an $\ell_{1}$ -cone in $\mathbb{R}^{d}$ , where ${\mathcal{S}}\subseteq\{1,\ldots,d\}$ denotes the support of $\bm{\beta}^{*}$ . As a counterpart of Lemma 4 in high dimensions, Lemma 5 below shows that the adaptive Huber loss satisfies the restricted strong convexity condition over $\Theta_{0}(r)\cap\mathcal{C}$ with high probability.

Lemma 5.

Assume $v_{\delta}<\infty$ for some $0<\delta\leq 1$ and $(\mathbb{E}\langle\bm{u},\widetilde{\bm{x}}\rangle^{4})^{1/4}\leq A_{1}\|\bm{u}\|_{2}$ for all $\bm{u}\in\mathbb{R}^{d}$ and some constant $A_{1}>0$ . Let $(n,d,\tau,r)$ satisfy

[TABLE]

Then with probability at least $1-d^{-1}$ ,

[TABLE]

uniformly over $\bm{\beta}\in\Theta_{0}(r)\cap\mathcal{C}$ .

Proof of Lemma 5.

The proof is based on an argument similar to that in the proof of Lemma 4. With slight abuse of notation, we keep using $\Delta(r)$ as the supremum of a random process:

[TABLE]

Provided $\tau\geq 2\max\{(4v_{\delta})^{1/(1+\delta)},4A_{1}^{2}r\}$ , it can be shown that

[TABLE]

According to (34), it remains to bound $\mathbb{E}\Delta(r)$ . Following the proof of Lemma 4, it suffices to focus on the (conditional) Gaussian process

[TABLE]

where $G^{\prime}_{i}$ are i.i.d. standard normal random variables that are independent of all other random variables. For every $\bm{\beta}\in\Theta_{0}(r)\cap\mathcal{C}$ , it is easy to see that

[TABLE]

implying

[TABLE]

Keep all other statements the same, we obtain

[TABLE]

With $\bm{x}_{i}=(x_{i1},\ldots,x_{id})^{\mathrm{\scriptstyle T}}\in\mathbb{R}^{d}$ , note that

[TABLE]

Since $G^{\prime}_{i}x_{ij}$ are sub-exponential/sub-gamma random variables, from Corollary 2.6 in Boucheron, Lugosi and Massart (2013) we find that

[TABLE]

Substituting this into (34) and taking $x=\log d$ , we obtain that with probability at least $1-d^{-1}$ ,

[TABLE]

for all sufficiently large $n$ that scales as $\kappa_{l}^{-1}(A_{0}\tau/r)^{2}\max_{1\leq j\leq d}\sigma_{jj}\,s\log d$ up to an absolute constant. This proves (39). ∎

Lemmas 6 and 7 provide concentration inequalities for $\|\bm{\Sigma}^{-1/2}\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{2}$ and $\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{\infty}$ , respetively.

Lemma 6.

Assume Condition 5 holds with $0<\delta\leq 1$ . Then with probability at least $1-2e^{-t}$ ,

[TABLE]

Proof of 6.

Assume without loss of generality that $t\geq\log 2$ , or equivalently, $2e^{-t}\leq 1$ ; otherwise $2e^{-t}>1$ so that the bound is trivial. To bound $\|\bm{\Sigma}^{-1/2}\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{2}$ , first define the centered random vector

[TABLE]

where $\xi_{i}=\psi_{\tau}(\varepsilon_{i})$ . To evaluate the $\ell_{2}$ -norm, there exits a $1/2$ -net $\mathcal{N}_{1/2}$ of the unit sphere $\mathbb{S}^{d-1}$ in $\mathbb{R}^{d}$ with $|\mathcal{N}_{1/2}|\leq 5^{d}$ such that $\|\bm{\xi}^{*}\|_{2}\leq 2\max_{\bm{u}\in\mathcal{N}_{1/2}}|\langle\bm{u},\bm{\xi}^{*}\rangle|$ . Under Condition 5, it holds for every $\bm{u}\in\mathbb{S}^{d-1}$ that $\mathbb{E}|\langle\bm{u},\widetilde{\bm{x}}\rangle|^{k}\leq A_{0}^{k}\,k\Gamma(k/2)$ for all $k\geq 1$ . By direct calculations,

[TABLE]

It then follows from Bernstein’s inequality that

[TABLE]

Taking the union bound over $\bm{u}\in\mathcal{N}_{1/2}$ , we obtain that with probability at least $1-5^{d}\cdot 2e^{-x}$ ,

[TABLE]

Next, for the deterministic part $\|\bm{\Sigma}^{-1/2}\nabla\mathbb{E}\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{2}$ , it is easy to see that

[TABLE]

Combining this and (42) with $x=2(d+t)$ , we reach the bound (41) which holds with probability at least $1-2e^{-2t}\geq 1-e^{-t}$ . ∎

Lemma 7.

Assume Condition 5 holds with $0<\delta\leq 1$ . Then with probability at least $1-2d^{-1}$ ,

[TABLE]

Proof of Lemma 7.

The proof is based on Bernstein’s inequality and the union bound. Define $\xi_{i}=\psi_{\tau}(\varepsilon_{i})$ for $i=1,\ldots,n$ such that $\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})=-n^{-1}\sum_{i=1}^{n}\xi_{i}\bm{x}_{i}$ . For every $1\leq j\leq d$ , note that $|\mathbb{E}(\xi_{i}x_{ij})|=|\mathbb{E}\{\mathbb{E}(\xi_{i}|x_{ij})x_{ij}\}|\leq\sigma_{jj}^{1/2}v_{\delta}\tau^{-\delta}$ . Moreover, from the proof of Lemma 6 we see that

[TABLE]

By Bernstein’s inequality, for any $x>0$ it holds

[TABLE]

with probability at least $1-2e^{-x}$ . By the union bound and taking $x=2\log d$ in the last display, we arrive at the stated result. ∎

C.2 Proof of Proposition 1

Define the error vector $\bm{\Delta}=\bm{\beta}^{*}-\bm{\beta}^{*}_{\tau}$ and function $h(\bm{\beta})=n^{-1}\sum_{i=1}^{n}\mathbb{E}\{\ell_{\tau}(y_{i}-\langle\bm{x}_{i},\bm{\beta}\rangle)\}$ , $\bm{\beta}\in\mathbb{R}^{d}$ . By the optimality of $\bm{\beta}^{*}_{\tau}$ and the mean value theorem, we have $\nabla h(\bm{\beta}^{*}_{\tau})=\textbf{0}$ and thus

[TABLE]

where $\widetilde{\bm{\beta}}_{1}=\lambda\bm{\beta}^{*}+(1-\lambda)\bm{\beta}^{*}_{\tau}$ for some $0\leq\lambda\leq 1$ .

Case 1. First we consider the case of $0<\delta<1$ . Since $\mathbb{E}(\varepsilon_{i})=0$ , we have $-\mathbb{E}\{\psi_{\tau}(\varepsilon_{i})\}=\mathbb{E}\{\varepsilon_{i}1(|\varepsilon_{i}|>\tau)-\tau 1(\varepsilon_{i}>\tau)+\tau 1(\varepsilon_{i}<-\tau)\}$ and therefore

[TABLE]

Taking $\widetilde{\varepsilon}_{i}=y_{i}-\langle\bm{x}_{i},\widetilde{\bm{\beta}}_{1}\rangle$ , we see that

[TABLE]

Note that

[TABLE]

This, together with the convexity of $h$ implies that $h(\widetilde{\bm{\beta}}_{1})\leq\lambda h(\bm{\beta}^{*})+(1-\lambda)h(\bm{\beta}^{*}_{\tau})\leq h(\bm{\beta}^{*})\leq v_{\delta}\tau^{1-\delta}$ , where $v_{\delta}=n^{-1}\sum_{i=1}^{n}v_{i,\delta}$ . For the lower bound, note that $h(\bm{\beta})\geq n^{-1}\mathbb{E}\{(\tau|y_{i}-\langle\bm{x}_{i},\bm{\beta}\rangle|-\tau^{2}/2)\}1(|y_{i}-\langle\bm{x}_{i},\bm{\beta}\rangle|>\tau)\}$ for all $\bm{\beta}\in\mathbb{R}^{d}$ . Putting these upper and lower bounds on $h(\widetilde{\bm{\beta}}_{1})$ together yields

[TABLE]

as a consequence of which $n^{-1}\sum_{i=1}^{n}\mathbb{P}(|\widetilde{\varepsilon}_{i}|>\tau)\leq 2v_{\delta}\tau^{-1-\delta}$ . Combining this with (45), we deduce that as long as $\tau>(2v_{\delta}\widetilde{M}^{2})^{1/(1+\delta)}$ ,

[TABLE]

This provides a lower bound for the left-hand side of (43). On the other hand, using (44) and Hölder’s inequality to bound the right-hand side of (43), the claim (7) for $0<\delta<1$ follows immediately.

Case 2. Next we assume $\delta\geq 1$ and note that $v_{i,1}=\mathbb{E}(\varepsilon_{i}^{2})$ . In this case, we have $\mathbb{E}\{\ell_{\tau}(\varepsilon_{i})\}\leq\frac{1}{2}v_{i,1}$ and $|\mathbb{E}\{\psi_{\tau}(\varepsilon_{i})\}|\leq v_{i,\delta}\tau^{-\delta}$ . Then, following the same arguments as above, it can be shown that as long as $\tau>v_{1}^{1/2}m_{n}$ ,

[TABLE]

This proves (7) for $\delta\geq 1$ and hence completes the proof. ∎

C.3 Proof of Theorem 1

Without loss of generality, we assume $t\geq 1$ throughout the proof; otherwise, $3e^{-t}\geq 1$ and the stated result holds trivially. For simplicity, we write $\widehat{\bm{\beta}}=\widehat{\bm{\beta}}_{\tau}$ . Note that for any prespecified $r>0$ , we can construct an intermediate estimator, denoted by $\widehat{}\bm{\beta}_{\tau,\eta}=\bm{\beta}^{*}+\eta(\widehat{}\bm{\beta}-\bm{\beta}^{*})$ , such that $\|\widehat{}\bm{\beta}_{\tau,\eta}-\bm{\beta}^{*}\|_{2}\leq r$ . To see that, we take $\eta=1$ if $\|\widehat{}\bm{\beta}-\bm{\beta}^{*}\|_{2}\leq r$ ; otherwise, we can always choose some $\eta\in(0,1)$ so that $\|\widehat{}\bm{\beta}_{\tau,\eta}-\bm{\beta}^{*}\|_{2}=r$ . Applying Lemma 2 gives

[TABLE]

where $\nabla\mathcal{L}_{\tau}(\widehat{}\bm{\beta})=\mathbf{0}$ according to the Karush-Kuhn-Tucker condition. By the mean value theorem for vector-valued functions, we have

[TABLE]

If, there exists some $a_{0}>0$ such that

[TABLE]

then we have $a_{0}\|\widehat{}\bm{\beta}_{\tau,\eta}-\bm{\beta}^{*}\|_{2}^{2}\leq\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{2}\|\widehat{}\bm{\beta}_{\tau,\eta}-\bm{\beta}^{*}\|_{2}$ . Canceling the common factor on both sides yields

[TABLE]

Define the random vector $\bm{\xi}^{*}=\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})$ , which can be written as

[TABLE]

By definition (18), $\psi_{1}(x)=\tau^{-1}\psi_{\tau}(\tau x)$ . We write $\Psi_{j}=\!n^{-1}\sum_{i=1}^{n}(x_{ij}/L)\psi_{1}(\varepsilon_{i}/\tau)$ for $j=1,\ldots,d$ , such that $\|\bm{\xi}^{*}\|_{2}\leq d^{1/2}\|\bm{\xi}^{*}\|_{\infty}=Ld^{1/2}\tau\max_{1\leq j\leq d}|\Psi_{j}|$ . With $0<\delta\leq 1$ , it is easy to see that the function $\psi_{1}(\cdot)$ satisfies

[TABLE]

for all $u\in\mathbb{R}$ . It follows that

[TABLE]

This, together with the inequality $(1+u)^{v}\leq 1+uv$ for $u\geq-1$ and $0<v\leq 1$ , implies

[TABLE]

Consequently, we have

[TABLE]

where we used the inequality $1+u\leq e^{u}$ in the last step. For any $z\geq 0$ , using Markov’s inequality gives

[TABLE]

As long as $\tau\geq(2/z)^{1/(1+\delta)}$ , we have $\mathbb{P}(\Psi_{j}\geq v_{\delta}z)\leq e^{-v_{\delta}nz/2}$ . On the other hand, it can be similarly shown that $\mathbb{P}(-\Psi_{j}\geq v_{\delta}z)\leq e^{-v_{\delta}nz/2}$ . For any $t>0$ , taking $z=2t/(v_{\delta}n)$ in these two inequalities yields that as long as $\tau\geq(v_{\delta}n/t)^{1/(1+\delta)}$ ,

[TABLE]

Taking $r=\tau/(4\sqrt{2}M)$ , it follows from Lemma 3 and the definition of $\tau$ that with probability at least $1-e^{-t}$ , (48) holds with $a_{0}=c_{l}/2$ provided

[TABLE]

Combining (49) and (51) implies that, with probability at least $1-(2d+1)e^{-t}$ ,

[TABLE]

Provided $n\geq 16\sqrt{2}c_{l}^{-1}LMd^{1/2}t$ , the intermediate estimator $\widehat{\bm{\beta}}_{\tau,\eta}$ will lie in the interior of the ball with radius $r$ . By our construction in the beginning of the proof, this enforces $\eta=1$ and thus $\widehat{\bm{\beta}}=\widehat{}\bm{\beta}_{\tau,\eta}$ . ∎

C.4 Proof of Theorem 2

We start by defining a simple class of distributions for the response variable $y$ as $\mathcal{P}_{c,\gamma}=\big{\{}\mathbb{P}_{c+},\mathbb{P}_{c-}\big{\}}$ , where

[TABLE]

Here, we suppress the dependence of $\mathbb{P}^{+}_{c}$ and $\mathbb{P}^{-}_{c}$ on $\gamma$ for convenience. It follows that, for any $0<\delta\leq 1$ , the $(1+\delta)$ -th absolute central moment $v_{\delta}$ of $y$ with law either $\mathbb{P}_{c}^{+}$ or $\mathbb{P}_{c}^{-}$ is

[TABLE]

For $i=1,\ldots,n$ , let $(y_{1i},y_{2i})$ be independent pairs of real-valued random variables satisfying

[TABLE]

Let $\bm{y}_{k}=(y_{k1},\ldots,y_{kn})^{\mathrm{\scriptstyle T}}$ for $k=1,2$ , and $\xi\in(0,1/2].$ Taking $\gamma=\log\{1/(2\xi)\}/(2n)$ with $\xi\geq e^{-n}/2$ , we obtain $1-\gamma\geq 1/2$ and

[TABLE]

By assumption, we know that there is an $n$ -dimensional vector $\mathbf{u}\in\{-1,+1\}^{n}$ with each coordinate taking $-1$ or $1$ such that $\frac{1}{n}\|\mathbf{X}^{\mathrm{\scriptstyle T}}\mathbf{u}\|_{\min}\geq\alpha$ . Note that this assumption naturally holds for the mean model, where $\mathbf{X}=(1,\ldots,1)^{\mathrm{\scriptstyle T}}$ and $\alpha$ can be taken as $1$ . Now we take $\bm{c}$ , $\bm{\beta}_{1}^{*}$ and $\bm{\beta}_{2}^{*}$ such that $\bm{c}=c\mathbf{u}$ for a $c>0$ , $\mathbf{X}\bm{\beta}_{1}^{*}=\bm{c}\gamma$ and $\bm{\beta}_{2}^{*}=-\bm{\beta}_{1}^{*}$ , which indicates that

[TABLE]

Let $\widehat{}\bm{\beta}_{k}(\bm{y}_{k})$ be any estimator possibly depending on $\xi$ , then the above calculation yields

[TABLE]

where we suppress the dependence of $\widehat{}\bm{\beta}_{k}$ on $\bm{y}_{k}$ for simplicity. Using the fact that $c\gamma\geq v_{\delta}^{1/(1+\delta)}(\gamma/2)^{\delta/(1+\delta)}$ further implies

[TABLE]

Now since $\mathcal{P}_{c,\gamma}\subseteq\mathcal{P}_{\delta}^{v_{\delta}}$ , taking $\log\{1/(2\xi)\}={2t}$ implies the result for the case where $\delta\in(0,1]$ . When $\delta>1$ , the second moment exists, and therefore using the fact that $v_{1}\!<\!\infty$ completes the proof. ∎

C.5 Proof of Theorem 3

We start with the proof of Lemma 1.

Proof of Lemma 1.

Let $\mathbf{H}_{\tau}=\nabla^{2}\mathcal{L}_{\tau}(\bm{\beta}),$ where we suppress the dependence on $\bm{\beta}$ . Then for any $(\bm{u},\bm{\beta})\in\mathcal{C}(k,\gamma,r)$ , we have

[TABLE]

As $\|\bm{x}_{i}\|_{\infty}\leq L$ for any $1\leq i\leq n$ , we have

[TABLE]

Moreover, for any $t\geq 0$ , applying Hoeffding’s inequality yields that, with probability at least $1-e^{-t}$ ,

[TABLE]

Putting the above calculations together, we obtain

[TABLE]

Consequently, as long as $\tau\geq 8Lr$ , the following inequality

[TABLE]

holds uniformly over $(\bm{u},\bm{\beta})\in\mathcal{C}(k,\gamma,r)$ with probability at least $1-e^{-t}$ , where the last inequality in (55) holds whenever $\tau\gtrsim(1+\gamma)^{2/(1+\delta)}\kappa_{l}^{-1/(1+\delta)}(L^{2}kv_{\delta})^{1/(1+\delta)}$ and $n\gtrsim(1+\gamma)^{4}\kappa_{l}^{-2}L^{4}k^{2}t$ . On the other side, it can be easily shown that $\langle\bm{u},\mathbf{H}_{\tau}\bm{u}\rangle\leq\kappa_{u}$ . This completes the proof of the lemma. ∎

The following lemma is taken from Fan et al. (2018) with slight modification, which shows that the solution $\widehat{}\bm{\beta}=\widehat{\bm{\beta}}_{\tau,\lambda}$ falls in a $\ell_{1}$ -cone.

Lemma 8 ( $\ell_{1}$ -cone Property).

For any $\mathcal{E}$ such that ${\mathcal{S}}\subseteq\mathcal{E}$ , if $\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{\infty}\leq\lambda/2$ , then $\|(\widehat{\bm{\beta}}-\bm{\beta}^{*})_{\mathcal{E}^{c}}\|_{1}\leq 3\|{(\widehat{\bm{\beta}}-\bm{\beta}^{*})}_{\mathcal{E}}\|_{1}.$

Now we are ready to prove the theorem.

Proof of Theorem 3.

It suffices to prove the statement for $\delta\in(0,1]$ . We start by constructing an intermediate estimator $\widehat{}\bm{\beta}_{\eta}=\bm{\beta}^{*}+\eta(\widehat{}\bm{\beta}-\bm{\beta}^{*})$ such that $\|\widehat{}\bm{\beta}_{\eta}-\bm{\beta}^{*}\|_{1}\leq r$ for some $r>0$ to be specified. We take $\eta=1$ if $\|\widehat{}\bm{\beta}-\bm{\beta}^{*}\|_{1}\leq r$ , and choose $\eta\in(0,1)$ so that $\|\widehat{}\bm{\beta}_{\eta}-\bm{\beta}^{*}\|_{1}=r$ otherwise. Lemma 8, $\widehat{}\bm{\beta}_{\eta}$ also falls in a $\ell_{1}$ -cone:

[TABLE]

Under Condition 3, it follows from Lemma 1 that with probability at least $1-e^{-t}$ ,

[TABLE]

as long as $\tau\gtrsim\max\{(L^{2}kv_{\delta})^{1/(1+\delta)},Lr\}$ and $n\gtrsim L^{4}k^{2}t$ . Applying Lemma 2 and following the same calculations as in Lemma B.7 of Fan et al. (2018), we obtain

[TABLE]

which, combined with $\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})_{\mathcal{S}}\|_{\infty}\leq\lambda/2$ , implies that

[TABLE]

Inequalities in (56) imply that $\|\widehat{}\bm{\beta}_{\eta}-\bm{\beta}^{*}\|_{1}\leq 4\|(\widehat{}\bm{\beta}_{\eta}-\bm{\beta}^{*})_{\mathcal{S}}\|_{1}\leq 4s^{1/2}\|\widehat{}\bm{\beta}_{\eta}-\bm{\beta}^{*}\|_{2}\leq 12\kappa_{l}^{-1}s\lambda<r$ . By the construction of $\widehat{}\bm{\beta}_{\eta}$ , we conclude that $\widehat{}\bm{\beta}_{\eta}=\widehat{}\bm{\beta}$ , and thus the stated result holds. It remains to bound the probability that event $\{\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})_{\mathcal{S}}\|_{\infty}\leq\lambda/2\}$ occurs. Recall the gradient of $\mathcal{L}_{\tau}$ evaluated at $\bm{\beta}^{*}$ , i.e. $\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})=-n^{-1}\sum_{i=1}^{n}\psi_{\tau}(\varepsilon_{i})\bm{x}_{i}$ . Following the same argument used in the proof of Theorem 1, we take $\tau=\tau_{0}(n/t)^{1/(1+\delta)}$ for some $\tau_{0}\geq\nu_{\delta}$ and reach $\mathbb{P}\{\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})_{\mathcal{S}}\|_{\infty}\geq 2L\tau n^{-1}t\}\leq 2se^{-t}$ . This, together with (57), proves (9).

Finally, taking $t=(1+c)\log d$ for some $c>0$ yields that with probability at least $1-(2s+1)d^{-1-c}$ , $\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})_{{\mathcal{S}}}\|_{\infty}\leq 2L\tau_{0}\{(1+c)(\log d)/n\}^{\delta/(1+\delta)}$ . As implied by Condition 3 with $k=2s$ , we have $2s+1\leq d$ and thus (10) follows immediately. ∎

C.6 Proof of Theorem 4

The proof of this theorem follows the similar argument to that of Theorem 2. It suffices to prove the result for $\delta\in(0,1]$ . Similar to the proof of Theorem 2, We start by defining a simple class of distributions for the response variable $y$ as $\mathcal{P}_{c,\gamma}=\{\mathbb{P}_{c+},\mathbb{P}_{c-}\}$ , where

[TABLE]

Here, we suppress the dependence of $\mathbb{P}^{+}_{c}$ and $\mathbb{P}^{-}_{c}$ on $\gamma$ for convenience. It follows that, for any $0<\delta\leq 1$ , the $(1+\delta)$ -th absolute central moment $v_{\delta}$ of $y$ with law either $\mathbb{P}_{c}^{+}$ or $\mathbb{P}_{c}^{-}$ is

[TABLE]

We define the following $s$ -sparse sign-ball $\mathcal{U}_{n}$ as

[TABLE]

By assumption, there exist $\mathbf{u}\in\mathcal{U}_{n}$ and $\mathcal{A}$ with $|\mathcal{A}|=s$ such that $\|\mathbf{X}_{\mathcal{A}}^{\mathrm{\scriptstyle T}}\mathbf{u}\|_{\min}/n\geq\alpha$ . Take $\bm{\beta}^{*}_{1},\,\bm{\beta}^{*}_{2}$ supported on $\mathcal{A}$ and $\bm{c}\in\mathbb{R}^{n}$ such that $\bm{c}=c\mathbf{u}$ for a $c>0$ , $\mathbf{X}\bm{\beta}_{1}^{*}=\bm{c}\gamma$ and $\bm{\beta}_{2}^{*}=-\bm{\beta}_{1}^{*}$ . Let $\mathbb{P}^{+}$ be the distribution of $\bm{y}_{1}=\mathbf{X}\bm{\beta}^{*}_{1}+\bm{\varepsilon}$ and $\mathbb{P}_{-}$ that of $\bm{y}_{2}=\mathbf{X}\bm{\beta}^{*}_{2}+\bm{\varepsilon}$ . Clearly, we have

[TABLE]

Let $\mathcal{A}$ be the support of $\bm{\beta}_{1}^{*}$ . Then, we have

[TABLE]

Let $\widehat{}\bm{\beta}_{k}(\bm{y}_{k})$ be any $s$ -sparse estimator. With the above setup, we have

[TABLE]

where we suppress the dependence of $\widehat{}\bm{\beta}_{k}$ on $\bm{y}_{k}$ for simplicity. For the last quantity in the displayed inequality above, taking $\gamma=\log\{1/(2t)\}/(2n)$ with $t\geq e^{-n}/2$ , we obtain $1-\gamma\geq 1/2$ and

[TABLE]

Using the fact that $c\gamma\geq v_{\delta}^{1/(1+\delta)}({\gamma}/{2})^{\delta/(1+\delta)},$ this further implies

[TABLE]

Now since $\mathcal{P}_{c,\gamma}\subseteq\mathcal{P}_{\delta}^{v_{\delta}}$ , taking $t=d^{-A}/2$ implies the result for the case where $\delta\in(0,1]$ . When $\delta>1$ , the second moment exists. Thus using $v_{1}<\infty$ completes the proof. ∎

C.7 Proof of Theorem 5

The proof is almost identical to that of Theorem 3. We only need to derive a probability bound for the event $\{\|\bm{\xi}^{*}_{{\mathcal{S}}}\|_{\infty}\leq\lambda/2\}$ under the assumed scaling and moment conditions, where $\bm{\xi}^{*}:=\nabla\mathcal{L}^{\varpi}_{\tau}(\bm{\beta}^{*})$ .

Recall that $\bm{x}^{\varpi}_{i}=(x^{\varpi}_{i1},\ldots,x^{\varpi}_{id})^{\mathrm{\scriptstyle T}}$ with $x_{ij}^{\varpi}=\psi_{\varpi}(x_{ij})$ for $i=1,\ldots,n$ and $j=1,\ldots,d$ . Define $\bm{z}_{i}=(z_{i1},\ldots,z_{id})^{\mathrm{\scriptstyle T}}=\bm{x}_{i}-\bm{x}_{i}^{\varpi}$ , where $z_{ij}=\{x_{ij}-\varpi\mathop{\mathrm{sign}}(x_{ij})\}1(|x_{ij}|>\varpi)$ . Moreover, write $\bm{z}_{i{\mathcal{S}}}=(z_{ij}1(j\in{\mathcal{S}}))\in\mathbb{R}^{d}$ and $\epsilon_{i}=\varepsilon_{i}+\langle\bm{z}_{i},\bm{\beta}^{*}\rangle$ . In this notation, we have $\bm{\xi}^{*}=-n^{-1}\sum_{i=1}^{n}\psi_{\tau}(\epsilon_{i}){\bm{x}}^{\varpi}_{i}$ . From the identity $\mathbb{E}\{\psi_{\tau}(\epsilon_{i}){x}^{\varpi}_{ij}\}=\mathbb{E}\{\langle\bm{z}_{i},\bm{\beta}^{*}\rangle{x}^{\varpi}_{ij}\}-\mathbb{E}\{\epsilon_{i}-\tau\mathop{\mathrm{sign}}(\epsilon_{i})\}{x}^{\varpi}_{ij}1(|\epsilon_{i}|>\tau)$ , we see that

[TABLE]

Then it holds

[TABLE]

For each $j$ fixed, note that

[TABLE]

Applying Bernstein’s inequality gives

[TABLE]

with probability at least $1-2e^{-t}$ . Taking the union bound over $j\in{\mathcal{S}}$ , we obtain that, with probability at least $1-2se^{-t}$ ,

[TABLE]

This, together with (60), implies that $\mathbb{P}\{\mathcal{E}(\tau,\varpi,\lambda)\}\geq 1-2se^{-t}$ provided

[TABLE]

This is the stated result. ∎

C.8 Proof of Theorem 7

To begin with, define the parameter set $\Theta_{0}(r)=\{\bm{\beta}\in\mathbb{R}^{d}:\|\bm{\beta}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}\leq r\}$ for some $r>0$ to be specified, and let $\widehat{}\bm{\beta}_{\tau,\eta}\in\Theta_{0}(r)$ be the intermediate estimator introduced in the proof of Theorem 1.

Proof of (19). In view of (47) and (49), lying in the heart of the arguments is to derive deviation inequalities for $\|\bm{\Sigma}^{-1/2}\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{2}$ under the moment condition that $v_{\delta}<\infty$ for some $0<\delta\leq 1$ , and to establish the restricted strong convexity for the Huber loss $\mathcal{L}_{\tau}$ , i.e. there exists some $\kappa>0$ such that

[TABLE]

holds uniformly over $\bm{\beta}$ in a neighborhood of $\bm{\beta}^{*}$ .

First, from (41) in Lemma 6 we see that

[TABLE]

with probability at least $1-e^{-t}$ . Next, since $\widehat{}\bm{\beta}_{\tau,\eta}\in\Theta_{0}(r)$ and according to Lemma 4, we take $r=\tau/(4A_{1}^{2})$ such that under the scaling (26),

[TABLE]

with probability at least $1-e^{-t}$ . Together, the last two displays and (47) imply that with probability at least $1-2e^{-t}$ , $\|\widehat{\bm{\beta}}_{\tau,\eta}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}\leq 4r_{0}<r$ provided $n\geq C_{1}(d+t)$ , where $C_{1}>0$ is a constant depending only on $A_{0}$ . Following the same arguments as we used in the proof of Theorem 1, this proves (19).

Proof of (20). From the preceding proof, we see that

[TABLE]

as long as $n\geq C_{1}(d+t)$ , where $r_{1}=4r_{0}$ . Moreover, define random processes $\bm{\zeta}(\bm{\beta})=\mathcal{L}_{\tau}(\bm{\beta})-\mathbb{E}\mathcal{L}_{\tau}(\bm{\beta})$ and

[TABLE]

To bound $\|\bm{B}(\widehat{}\bm{\beta}_{\tau})\|_{2}=\|\bm{\Sigma}^{1/2}(\widehat{}\bm{\beta}_{\tau}-\bm{\beta}^{*})+\bm{\Sigma}^{-1/2}\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{2}$ , the key is to bound the supremum of the empirical process $\{\bm{B}(\bm{\beta}):\bm{\beta}\in\Theta_{0}(r)\}$ . To that end, we deal with $\bm{B}(\bm{\beta})-\mathbb{E}\{\bm{B}(\bm{\beta})\}$ and $\mathbb{E}\{\bm{B}(\bm{\beta})\}$ separately, starting with the latter. By the mean value theorem,

[TABLE]

where $\widetilde{}\bm{\beta}$ is a convex combination of $\bm{\beta}$ and $\bm{\beta}^{*}$ . Therefore,

[TABLE]

For $\bm{\beta}\in\Theta_{0}(r)$ and $\bm{u}\in\mathbb{S}^{d-1}$ , write $\bm{\delta}=\bm{\Sigma}^{1/2}(\bm{\beta}-\bm{\beta}^{*})$ such that $\|\bm{\delta}\|_{2}\leq r$ . Let $A_{1}>0$ be the constant in Lemma 4 that scales as $A_{0}$ . It follows that

[TABLE]

which, further implies

[TABLE]

Next, we consider $\bm{B}(\bm{\beta})-\mathbb{E}\{\bm{B}(\bm{\beta})\}=\bm{\Sigma}^{-1/2}\{\nabla\bm{\zeta}(\bm{\beta})-\nabla\bm{\zeta}(\bm{\beta}^{*})\}$ . With $\bm{\delta}=\bm{\Sigma}^{1/2}(\bm{\beta}-\bm{\beta}^{*})$ , define a new process $\overline{\bm{B}}(\bm{\delta})=\bm{B}(\bm{\beta})-\mathbb{E}\{\bm{B}(\bm{\beta})\}$ , satisfying $\overline{\bm{B}}(\textbf{0})=\textbf{0}$ and $\mathbb{E}\{\overline{\bm{B}}(\bm{\delta})\}=\textbf{0}$ . Note that, for every $\bm{u},\bm{v}\in\mathbb{S}^{d-1}$ and $\lambda\in\mathbb{R}$ ,

[TABLE]

Under Condition 5, there exist constants $C_{2},C_{3}>0$ depending only on $A_{0}$ such that, for any $|\lambda|\leq\sqrt{n/C_{2}}$ ,

[TABLE]

With the above preparations and applying Theorem A.3 in Spokoiny (2013), we reach

[TABLE]

as long as $n\geq C_{2}(8d+2t)$ . Together with (63), this yields

[TABLE]

with probability at least $1-e^{-t}$ . Combine this bound with (61) to obtain the stated result (20). ∎

C.9 Proof of Proposition 2

Since $\mathbb{E}(\varepsilon)=0$ , we have $\mathbb{E}\{\psi_{\tau}(\varepsilon)\}=-\mathbb{E}\{(\varepsilon-\tau)1(\varepsilon>\tau)\}+\mathbb{E}\{(-\varepsilon-\tau)1(\varepsilon<-\tau)\}$ . Thus, for any $2\leq q\leq 2+\kappa$ , $|\mathbb{E}\psi_{\tau}(\varepsilon)|\leq\mathbb{E}\{|\varepsilon|-\tau)1(|\varepsilon|>\tau)\}\leq\tau^{1-q}\,\mathbb{E}(|\varepsilon|^{q})$ . In particular, taking $q$ to be 2 and $2+\kappa$ proves the first conclusion. Next, note that $\mathbb{E}\{\psi_{\tau}^{2}(\varepsilon)\}=\mathbb{E}(\varepsilon^{2})-\{\mathbb{E}\varepsilon^{2}1(|\varepsilon|>\tau)-\tau^{2}\mathbb{P}(|\varepsilon|>\tau)\}$ . Letting $\eta=|\varepsilon|$ , we deduce that

[TABLE]

By Markov’s inequality, $\int_{\tau}^{\infty}y\mathbb{P}(\eta>y)\,dy\leq\mathbb{E}(\eta^{2+\kappa})\int_{\tau}^{\infty}y^{-1-\kappa}\,dy=\kappa^{-1}\tau^{-\kappa}\,\mathbb{E}(\eta^{2+\kappa})$ . Putting the above calculations together proves the second inequality. ∎

C.10 Proof of Theorem 8

For simplicity, we write $\widehat{}\bm{\beta}=\widehat{}\bm{\beta}_{\tau,\lambda}$ and assume without loss of generality that $0<\delta\leq 1$ . As in the proof of Theorem 7, we construct an intermediate estimator $\widetilde{}\bm{\beta}_{\eta}=\bm{\beta}^{*}+\eta(\widehat{}\bm{\beta}-\bm{\beta}^{*})$ satisfying $\|\widetilde{}\bm{\beta}_{\eta}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}\leq r$ for some $r>0$ to be specified. We take $\eta=1$ if $\|\widehat{}\bm{\beta}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}\leq r$ ; otherwise if $\|\widehat{}\bm{\beta}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}>r$ , there exists $\eta\in(0,1)$ such that $\|\widetilde{}\bm{\beta}_{\eta}-\bm{\beta}^{*}\|_{\bm{\Sigma},2}=r$ . Lemma 2 demonstrates that

[TABLE]

Next, let ${\mathcal{S}}\subseteq\{1,\ldots,d\}$ be the support of $\bm{\beta}^{*}$ and define the $\ell_{1}$ -cone $\mathcal{C}\subseteq\mathbb{R}^{d}$ :

[TABLE]

We claim that

[TABLE]

from which it follows

[TABLE]

where $\widehat{}\bm{\delta}:=\widehat{}\bm{\beta}-\bm{\beta}^{*}$ . To prove (65), first, from the optimality of $\widehat{\bm{\beta}}$ we see that

[TABLE]

By direct calculation, we have

[TABLE]

Under the scaling $\lambda\geq 2\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{\infty}$ , it follows from the convexity of $\mathcal{L}_{\tau}$ and Cauchy-Schwarz inequality that

[TABLE]

Together, (67) and (68) imply $0\leq\frac{\lambda}{2}(3\|\widehat{}\bm{\delta}_{{\mathcal{S}}}\|_{1}-\|\widehat{}\bm{\delta}_{{\mathcal{S}}^{{\rm c}}}\|_{1})$ and thus $\widehat{}\bm{\beta}\in\mathcal{C}$ .

By necessary conditions of extrema in the convex optimization problem (21),

[TABLE]

where $\widehat{\bm{z}}\in\partial\|\widehat{}\bm{\beta}\|_{1}$ satisfies $\langle\widehat{\bm{z}},\bm{\beta}^{*}-\widehat{}\bm{\beta}\rangle\leq\|\bm{\beta}^{*}\|_{1}-\|\widehat{}\bm{\beta}\|_{1}$ . Under the scaling $\lambda\geq 2\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{\infty}$ , it holds

[TABLE]

Together with (64), this implies

[TABLE]

Moreover, we introduce $\widetilde{}\bm{\delta}_{\eta}=\widetilde{}\bm{\beta}_{\eta}-\bm{\beta}^{*}$ and note that $\widetilde{}\bm{\delta}_{\eta}=\eta\widehat{}\bm{\delta}$ . By (65), we also have $\widetilde{}\bm{\beta}_{\eta}\in\mathcal{C}$ under the assumed scaling.

Let $\Omega_{r}$ be the event on which (39) holds. Then $\mathbb{P}(\Omega_{r}^{{\rm c}})\leq d^{-1}$ under the scaling (38) and it holds on $\Omega_{r}\cap\{\lambda\geq 2\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{\infty}\}$ that

[TABLE]

Substituting this lower bound into (69) yields

[TABLE]

Canceling $\|\widetilde{}\bm{\delta}_{\eta}\|_{2}$ on both sides delivers

[TABLE]

under the scaling $\lambda\geq 2\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{\infty}$ and (38) .

It remains to calibrate the parameters $\tau,\lambda$ and $r$ . First, applying Lemma 7 with $\tau=\tau_{0}(n/\log d)^{1/(1+\delta)}$ , we see that

[TABLE]

with probability at least $1-2d^{-1}$ , where $c_{1}=(2\sqrt{2}+1)A_{0}+1$ . We therefore choose $\lambda=c_{2}\max_{1\leq j\leq d}\sigma_{jj}^{1/2}\tau_{0}\{(\log d)/n\}^{\delta/(1+\delta)}$ for some constant $c_{2}\geq 2c_{1}$ , such that $\lambda\geq 2\|\nabla\mathcal{L}_{\tau}(\bm{\beta}^{*})\|_{\infty}$ with probability at least $1-2d^{-1}$ . Next, according to (38), the restricted strong convexity (39) holds with $r\asymp\kappa_{l}^{-1/2}A_{0}\max_{1\leq j\leq d}\sigma_{jj}^{1/2}\tau\sqrt{(\log d)/n}$ . Putting the above calculations together, we conclude that

[TABLE]

with probability at least $1-3d^{-1}$ , assuming the scaling $n\gtrsim\kappa_{l}^{-1}A_{0}^{2}A_{1}^{4}\max_{1\leq j\leq d}\sigma_{jj}\,s\log d$ . By the construction of $\widetilde{}\bm{\beta}_{\eta}$ , with the same probability we must have $\eta=1$ and therefore $\widehat{}\bm{\beta}=\widetilde{}\bm{\beta}_{\eta}$ . The stated result (23) then follows from (70). ∎

C.11 Proof of Corollary 1

Recall that $\bm{x}_{1},\ldots,\bm{x}_{n}$ are i.i.d. random vectors from a sub-Gaussian vector $\bm{x}=(x_{1},\ldots,x_{d})^{\mathrm{\scriptstyle T}}$ with $\mathbb{E}(\bm{x}\bm{x}^{\mathrm{\scriptstyle T}})=\bm{\Sigma}$ . Let $\bm{\Psi}=\mathbf{X}\bm{\Sigma}^{-1/2}$ be an $n\times d$ matrix whose rows are independent isotropic sub-Gaussian random vectors. Since $\kappa_{l}=\lambda_{\min}(\bm{\Sigma})>0$ , Definition 1 in Rudelson and Zhou (2013) holds with $s_{0}=s$ , $k_{0}=3$ , $A=\bm{\Sigma}^{1/2}$ and $K(s_{0},k_{0},A)=\kappa_{l}^{-1/2}$ . Taking $\delta=1$ in Theorem 16 of Rudelson and Zhou (2013) we obtain that, with probability at least $1-2d^{-1}$ ,

[TABLE]

for all $\bm{\beta}\in\mathcal{C}$ as long as $n\gtrsim\kappa_{l}^{-1}A_{0}^{4}\max_{0\leq j\leq d}\sigma_{jj}\,s\log d$ . This, together with (65) and (71), proves (24). ∎

References

Belloni and Chernozhukov (2011)

Belloni, A. and Chernozhukov, V. (2011). $\ell_{1}$ -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39 82–130.

Bickel, Ritov and Tsybakov (2009)

Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. The Annals of Statistics, 37 1705–1732.

Boucheron, Lugosi and Massart (2013)

Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford.

Bousquet (2003)

Bousquet, O. (2003). Concentration inequalities for sub-additive functions using the entropy method. In Stochastic Inequalities and Applications. Progress in Probability 56 213–247. Birkhäuser, Basel.

Fan et al. (2018)

Fan, J., Liu, H., Sun, Q. and Zhang, T. (2018).

I-LAMM for sparse learning: Simultaneous control of algorithmic complexity and statistical error. The Annals of Statistics, 96 1348–1360.

Ledoux and Talagrand (1991)

Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer-Verlag, Berlin.

Lepski (1991)

Lepski, O. V. (1991). Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates. IEEE Transactions on Information Theory, 36 682–697.

Loh and Wainwright (2015)

Loh, P.-L. and Wainwright, M. J. (2015). Regularized $M$ -estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16 559–616.

Negahban et al. (2012)

Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified framework for high-dimensional analysis of $M$ -estimators with decomposable regularizers. Statistical Science, 27 538–557.

Rudelson and Zhou (2013)

Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements. IEEE Transactions on Information Theory, 59 3434–3447.

Spokoiny (2013)

Spokoiny, V. (2013). Bernstein–von Mises theorem for growing parameter dimension. Preprint. Available at arXiv:1302.3430.

van de Geer (2008)

van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36 614–645.

Wainwright (2009)

Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$ -constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55 2183–2202.

Bibliography70

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alquier, Cottett and Lecué (2017) Alquier, P. , Cottet, V. and Lecué, G. (2017). Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. Preprint. Available at ar Xiv:1702.01402 .
2Belloni and Chernozhukov (2011) Belloni, A. and Chernozhukov, V. (2011). ℓ 1 subscript ℓ 1 \ell_{1} -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics , 39 82–130.
3Bellec et al. (2018) Bellec, P. C. , Lecué, G. and Tsybakov, A. B. (2018). Slope meets Lasso: Improved oracle bounds and optimality. The Annals of Statistics , 46 3603–3642.
4Bickel, Ritov and Tsybakov (2009) Bickel, P. J. , Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics , 37 1705–1732.
5Bogdan et al. (2015) Bogdan, M. , van den Berg, E. , Sabatti, C. , Su, W. and Candès, E. J. (2015). SLOPE–Adaptive variable selection via convex optimization. The Annals of Applied Statistics , 9 1103–1140.
6Brownlees, Joly and Lugosi (2015) Brownlees, C. , Joly, E. and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics , 43 2507–2536.
7Bühlmann and van de Geer (2011) Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
8Catoni (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Annales de I’Institut Henri Poincaré - Probabilités et Statistiques , 48 1148–1185.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Adaptive Huber Regression††thanks: Qiang Sun is Assistant Professor, Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada (E-mail: [email protected]).

Abstract

1 Introduction

1.1 Related Literature

2 Methodology

Definition 1** (Huber Loss and Robustification Parameter).**

3 Nonasymptotic Theory

3.1 Adaptive Huber Regression with Increasing Dimensions

Condition 1**.**

Proposition 1**.**

Theorem 1** (Upper Bound).**

Remark 1**.**

Remark 2**.**

Remark 3**.**

Theorem 2** (Lower Bound).**

Remark 4**.**

3.2 Adaptive Huber Regression in High Dimensions

Definition 2** (Localized Restricted Eigenvalue, LRE).**

Condition 2**.**

Definition 3** (Restricted Eigenvalue, RE).**

Condition 3**.**

Lemma 1**.**

Theorem 3** (Upper Bound in High Dimensions).**

Remark 5**.**

Remark 6**.**

Theorem 4** (Lower Bound in High Dimensions).**

4 Extension to Heavy-tailed Designs

Condition 4**.**

Theorem 5**.**

Remark 7**.**

5 Algorithm and Implementation

6 Numerical Studies

6.1 Tuning Parameter and Finite Sample Performance

6.2 Phase Transition

6.3 Effective Sample Size

6.4 A Real Data Example: NCI-60 Cancer Cell Lines

Supplementary Materials

Acknowledgments

Appendix

Appendix A A Lepski-type method

Theorem 6**.**

Proof of Theorem 6.

Appendix B Random Design Analysis

Condition 5**.**

B.1 Huber regression in low dimensions

Theorem 7**.**

Proposition 2**.**

B.2 Huber regression in high dimensions

Theorem 8**.**

Corollary 1**.**

Appendix C Proofs of Main Theorems

C.1 Auxiliary Lemmas

Lemma 2**.**

Proof of Lemma 2.

Lemma 3**.**

Proof of Lemma 3.

Lemma 4**.**

Proof of Lemma 4.

Lemma 5**.**

Proof of Lemma 5.

Lemma 6**.**

Proof of 6.

Lemma 7**.**

Proof of Lemma 7.

C.2 Proof of Proposition 1

C.3 Proof of Theorem 1

C.4 Proof of Theorem 2

C.5 Proof of Theorem 3

Proof of Lemma 1.

Lemma 8** (ℓ1\ell_{1}ℓ1​-cone Property).**

Proof of Theorem 3.

C.6 Proof of Theorem 4

Definition 1 (Huber Loss and Robustification Parameter).

Condition 1.

Proposition 1.

Theorem 1 (Upper Bound).

Remark 1.

Remark 2.

Remark 3.

Theorem 2 (Lower Bound).

Remark 4.

Definition 2 (Localized Restricted Eigenvalue, LRE).

Condition 2.

Definition 3 (Restricted Eigenvalue, RE).

Condition 3.

Lemma 1.

Theorem 3 (Upper Bound in High Dimensions).

Remark 5.

Remark 6.

Theorem 4 (Lower Bound in High Dimensions).

Condition 4.

Theorem 5.

Remark 7.

Theorem 6.

Condition 5.

Theorem 7.

Proposition 2.

Theorem 8.

Corollary 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8 ( $\ell_{1}$ -cone Property).