High Dimensional Robust $M$-Estimation: Arbitrary Corruption and Heavy   Tails

Liu Liu; Tianyang Li; Constantine Caramanis

arXiv:1901.08237·cs.LG·May 31, 2019

High Dimensional Robust $M$-Estimation: Arbitrary Corruption and Heavy Tails

Liu Liu, Tianyang Li, Constantine Caramanis

PDF

Open Access

TL;DR

This paper introduces a flexible framework for high-dimensional $M$-estimation under heavy tails and arbitrary corruptions, providing new robust algorithms with optimal statistical guarantees.

Contribution

The paper defines the Robust Descent Condition (RDC) and shows that it enables robust, minimax-optimal $M$-estimation algorithms in heavy-tailed and corrupted data scenarios.

Findings

01

Median-of-means gradient estimator satisfies RDC for heavy tails.

02

Trimmed gradient estimator satisfies RDC for arbitrary corruptions.

03

Robust Hard Thresholding achieves minimax optimal rates in tested scenarios.

Abstract

We consider the problem of sparsity-constrained $M$ -estimation when both explanatory and response variables have heavy tails (bounded 4-th moments), or a fraction of arbitrary corruptions. We focus on the $k$ -sparse, high-dimensional regime where the number of variables $d$ and the sample size $n$ are related through $n \sim k lo g d$ . We define a natural condition we call the Robust Descent Condition (RDC), and show that if a gradient estimator satisfies the RDC, then Robust Hard Thresholding (IHT using this gradient estimator), is guaranteed to obtain good statistical rates. The contribution of this paper is in showing that this RDC is a flexible enough concept to recover known results, and obtain new robustness results. Specifically, new results include: (a) For $k$ -sparse high-dimensional linear- and logistic-regression with heavy tail (bounded 4-th moment) explanatory and response…

Equations134

\displaystyle\left\lvert\left\langle{\widehat{\bm{G}}({\bm{\beta}})-{{\bm{G}}}({\bm{\beta}})},{\widetilde{{\bm{\beta}}}-{\bm{\beta}}^{*}}\right\rangle\right\rvert\leq\Big{(}\alpha\left\lVert{\bm{\beta}}-{\bm{\beta}}^{*}\right\rVert_{2}+\psi\Big{)}\left\lVert\widetilde{{\bm{\beta}}}-{\bm{\beta}}^{*}\right\rVert_{2}.

\displaystyle\left\lvert\left\langle{\widehat{\bm{G}}({\bm{\beta}})-{{\bm{G}}}({\bm{\beta}})},{\widetilde{{\bm{\beta}}}-{\bm{\beta}}^{*}}\right\rangle\right\rvert\leq\Big{(}\alpha\left\lVert{\bm{\beta}}-{\bm{\beta}}^{*}\right\rVert_{2}+\psi\Big{)}\left\lVert\widetilde{{\bm{\beta}}}-{\bm{\beta}}^{*}\right\rVert_{2}.

β_{j} = β \in R^{d - 1} arg min \frac{1}{m} i = 1 \sum m (x_{ij} - x_{i (j)}^{⊤} β)^{2}, s.t. ∥ β ∥_{0} \leq k, for each j \in [d],

β_{j} = β \in R^{d - 1} arg min \frac{1}{m} i = 1 \sum m (x_{ij} - x_{i (j)}^{⊤} β)^{2}, s.t. ∥ β ∥_{0} \leq k, for each j \in [d],

⟨ G (β) - G (β), β - β^{*} ⟩ \leq G (β) - G (β)_{\infty} β - β^{*}_{1} .

⟨ G (β) - G (β), β - β^{*} ⟩ \leq G (β) - G (β)_{\infty} β - β^{*}_{1} .

E (g - G) (g - G)^{⊤}_{op}

E (g - G) (g - G)^{⊤}_{op}

\leq v_{1} \in S^{d - 1} sup v_{1}^{⊤} E ((x x^{⊤} - Σ) Δ Δ^{⊤} (x x^{⊤} - Σ)) v_{1} + σ^{2} ∥ Σ ∥_{op}

\leq v_{1} \in S^{d - 1} sup ⟨ Δ Δ^{⊤}, E (x x^{⊤} - Σ) v_{1} v_{1}^{⊤} (x x^{⊤} - Σ) ⟩ + σ^{2} ∥ Σ ∥_{op}

\leq (i) ∥ Δ ∥_{2}^{2} v_{1}, v_{2} \in S^{d - 1} sup E (v_{2}^{⊤} (x x^{⊤} - Σ) v_{1})^{2} + σ^{2} ∥ Σ ∥_{op}

\leq 2 ∥ Δ ∥_{2}^{2} v_{1}, v_{2} \in S^{d - 1} sup (E (v_{2}^{⊤} (x x^{⊤}) v_{1})^{2} + ∥ Σ ∥_{op}^{2}) + σ^{2} ∥ Σ ∥_{op}

\leq 2 ∥ Δ ∥_{2}^{2} v_{1}, v_{2} \in S^{d - 1} sup (E (v_{2}^{⊤} x)^{4} E (x^{⊤} v_{1})^{4} + ∥ Σ ∥_{op}^{2}) + σ^{2} ∥ Σ ∥_{op}

\leq (ii) 2 (C_{4} + 1) ∥ Σ ∥_{op}^{2} ∥ Δ ∥_{2}^{2} + σ^{2} ∥ Σ ∥_{op},

v \in S^{d - 1} sup v^{⊤} (G - G) = (i) O (∥ Cov (g) ∥_{op} lo g d / n) = O (ρ^{2} ∥ β - β^{*} ∥_{2}^{2} + ρ σ^{2} lo g d / n)

v \in S^{d - 1} sup v^{⊤} (G - G) = (i) O (∥ Cov (g) ∥_{op} lo g d / n) = O (ρ^{2} ∥ β - β^{*} ∥_{2}^{2} + ρ σ^{2} lo g d / n)

E [exp (t (X - μ))] \leq exp (\frac{ν ^{2} t ^{2}}{2}), for all ∣ t ∣ < \frac{1}{ν} .

E [exp (t (X - μ))] \leq exp (\frac{ν ^{2} t ^{2}}{2}), for all ∣ t ∣ < \frac{1}{ν} .

exp (- \frac{n t ^{2}}{2 ν ^{2}}) if 0 \leq t \leq ν, and

exp (- \frac{n t ^{2}}{2 ν ^{2}}) if 0 \leq t \leq ν, and

exp (- \frac{n t}{2 ν}) for t > ν .

Pr (\frac{1}{n} i = 1 \sum n X_{i} - μ \geq t) \leq 2 exp (- n min (\frac{t ^{2}}{2 ν ^{2}}, \frac{t}{2 ν})) .

Pr (\frac{1}{n} i = 1 \sum n X_{i} - μ \geq t) \leq 2 exp (- n min (\frac{t ^{2}}{2 ν ^{2}}, \frac{t}{2 ν})) .

trmean_{α} {x_{i} : i \in S^{j}} - μ^{j} = O (ν (ϵ lo g (n d) + \frac{lo g d}{n}))

trmean_{α} {x_{i} : i \in S^{j}} - μ^{j} = O (ν (ϵ lo g (n d) + \frac{lo g d}{n}))

g = x (x^{⊤} β - y), and G = E (g) = Σ (β - β^{*}),

g = x (x^{⊤} β - y), and G = E (g) = Σ (β - β^{*}),

v^{⊤} g = v^{⊤} x x^{⊤} Δ - v^{⊤} x ξ, and v^{⊤} G = v^{⊤} Σ Δ.

v^{⊤} g = v^{⊤} x x^{⊤} Δ - v^{⊤} x ξ, and v^{⊤} G = v^{⊤} Σ Δ.

E [exp (t (v^{⊤} g - v^{⊤} G))] = E [exp (t (v^{⊤} (x x^{⊤} - Σ) Δ - v^{⊤} x ξ))] .

E [exp (t (v^{⊤} g - v^{⊤} G))] = E [exp (t (v^{⊤} (x x^{⊤} - Σ) Δ - v^{⊤} x ξ))] .

E_{x, ξ} [exp (t (v^{⊤} (x x^{⊤} - Σ) Δ - v^{⊤} x ξ))]

E_{x, ξ} [exp (t (v^{⊤} (x x^{⊤} - Σ) Δ - v^{⊤} x ξ))]

= (i) k = 0 \sum \infty \frac{1}{k !} (2 t)^{k} E [γ^{k} (v^{⊤} x x^{⊤} Δ - v^{⊤} x ξ)^{k}]

= (ii) 1 + l = 1 \sum \infty \frac{1}{( 2 l ) !} (2 t)^{2 l} E [(v^{⊤} x)^{2 l} (x^{⊤} Δ - ξ)^{2 l}],

E [(v^{⊤} x)^{2 l} (x^{⊤} Δ - ξ)^{2 l}] \leq E [(v^{⊤} x)^{4 l}] E [(x^{⊤} Δ - ξ)^{4 l}] .

E [(v^{⊤} x)^{2 l} (x^{⊤} Δ - ξ)^{2 l}] \leq E [(v^{⊤} x)^{4 l}] E [(x^{⊤} Δ - ξ)^{4 l}] .

E [(v^{⊤} x)^{4 l}]

E [(v^{⊤} x)^{4 l}]

E [(x^{⊤} Δ - ξ)^{4 l}]

E [exp (t (v^{⊤} g - v^{⊤} G))]

E [exp (t (v^{⊤} g - v^{⊤} G))]

\leq (i) 1 + l = 1 \sum \infty (4 t)^{2 l} (8 e)^{4 l} (∥ Δ ∥_{2}^{2} + σ^{2})^{l}

= 1 + l = 1 \sum \infty (4 t)^{2 l} (8 e^{2})^{2 l} (∥ Δ ∥_{2}^{2} + σ^{2})^{2 l},

E [exp (t (v^{⊤} g - v^{⊤} G))] \leq \frac{1}{1 - f ^{2} ( t )} \leq exp (f^{2} (t)) .

E [exp (t (v^{⊤} g - v^{⊤} G))] \leq \frac{1}{1 - f ^{2} ( t )} \leq exp (f^{2} (t)) .

trmean_{α} {x_{i} : i \in S^{j}} - μ^{j} = O (∥ Δ ∥_{2}^{2} + σ^{2} (ϵ lo g (n d) + \frac{lo g d}{n})),

trmean_{α} {x_{i} : i \in S^{j}} - μ^{j} = O (∥ Δ ∥_{2}^{2} + σ^{2} (ϵ lo g (n d) + \frac{lo g d}{n})),

G - G_{\infty} = O (∥ Δ ∥_{2}^{2} + σ^{2} (ϵ lo g (n d) + \frac{lo g d}{n})),

G - G_{\infty} = O (∥ Δ ∥_{2}^{2} + σ^{2} (ϵ lo g (n d) + \frac{lo g d}{n})),

g = \frac{- y x}{1 + exp ( y x ^{⊤} β )},

g = \frac{- y x}{1 + exp ( y x ^{⊤} β )},

trmean_{α} {x_{i} : i \in S^{j}} - μ = O (ϵ lo g (n d) + \frac{lo g d}{n})

trmean_{α} {x_{i} : i \in S^{j}} - μ = O (ϵ lo g (n d) + \frac{lo g d}{n})

G - G_{\infty} = O (ϵ lo g (n d) + \frac{lo g d}{n})

G - G_{\infty} = O (ϵ lo g (n d) + \frac{lo g d}{n})

E_{i \in_{u} G^{j}} x_{i} = \frac{\sum _{i \in G^{j}} x _{i}}{( 1 - ϵ ) n} \leq A_{1} \frac{\sum _{i \in G^{j}} x _{i}}{( 1 - ϵ ) n} + A_{2} \frac{\sum _{i \in G^{j} ∖ G^{j}} x _{i}}{( 1 - ϵ ) n} .

E_{i \in_{u} G^{j}} x_{i} = \frac{\sum _{i \in G^{j}} x _{i}}{( 1 - ϵ ) n} \leq A_{1} \frac{\sum _{i \in G^{j}} x _{i}}{( 1 - ϵ ) n} + A_{2} \frac{\sum _{i \in G^{j} ∖ G^{j}} x _{i}}{( 1 - ϵ ) n} .

Pr (\frac{\sum _{i \in G^{j}} x _{i}}{( 1 - ϵ ) n} \geq c_{0} ν \frac{lo g d}{n})

Pr (\frac{\sum _{i \in G^{j}} x _{i}}{( 1 - ϵ ) n} \geq c_{0} ν \frac{lo g d}{n})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Statistical Methods and Inference · Stochastic Gradient Optimization Techniques

Full text

High Dimensional Robust $M$ -Estimation: Arbitrary Corruption and Heavy Tails

Liu Liu

[email protected]

Tianyang Li

[email protected]

Constantine Caramanis

[email protected]

The University of Texas at Austin

Abstract

We consider the problem of sparsity-constrained $M$ -estimation when both explanatory and response variables have heavy tails (bounded 4-th moments), or a fraction of arbitrary corruptions. We focus on the $k$ -sparse, high-dimensional regime where the number of variables $d$ and the sample size $n$ are related through $n\sim k\log d$ . We define a natural condition we call the Robust Descent Condition (RDC), and show that if a gradient estimator satisfies the RDC, then Robust Hard Thresholding (IHT using this gradient estimator), is guaranteed to obtain good statistical rates. The contribution of this paper is in showing that this RDC is a flexible enough concept to recover known results, and obtain new robustness results. Specifically, new results include: (a) For $k$ -sparse high-dimensional linear- and logistic-regression with heavy tail (bounded 4-th moment) explanatory and response variables, a linear-time-computable median-of-means gradient estimator satisfies the RDC, and hence Robust Hard Thresholding is minimax optimal; (b) When instead of heavy tails we have $O(1/\sqrt{k}\log(nd))$ -fraction of arbitrary corruptions in explanatory and response variables, a near linear-time computable trimmed gradient estimator satisfies the RDC, and hence Robust Hard Thresholding is minimax optimal. We demonstrate the effectiveness of our approach in sparse linear, logistic regression, and sparse precision matrix estimation on synthetic and real-world US equities data.

1 Introduction

$M$ -estimation is a standard technique for statistical estimation [vdV00]. The past decade has seen successful extensions of $M$ -estimation to the high dimensional setting with sparsity (or other low-dimensional structure), e.g., using Lasso [Tib96, BvdG11, HTW15, Wai19]. Yet sparse modeling in high dimensions is NP-hard in the worst case [BDMS13, ZWJ14]. Thus theoretical sparse recovery guarantees for most computationally tractable approaches (e.g., $\ell_{1}$ minimization [Don06, CRT04, Wai09], Iterative Hard Thresholding [BD09]) rely on strong assumptions on the probabilistic models of the data, such as sub-Gaussianity. Under such assumptions, these approaches achieve the minimax rate for sparse regression [RWY11].

Meanwhile, statistical estimation with heavy tailed outliers or even arbitrary corruptions has long been a focus in robust statistics [Box53, Tuk75, Hub11, HRRS11].111Following [Min18, FWZ16], by heavy-tail we mean satisfying only weak moment bounds, specifically, bounded 4-th order moments (compared to sub-exponential or sub-Gaussian). But heavy-tails and arbitrary corruptions in the data violate the assumptions required for convergence of the usual algorithms. A central question then, is what assumptions are sufficient to enable efficient and robust algorithms for high dimensional $M$ -estimation under heavy tails or arbitrary corruption.

Huber’s seminal work [Hub64] and more modern followup work [Loh17] has considered replacing the classical least squared risk minimization objective with a robust counterpart (e.g., Huber loss). Other approaches (e.g., [Li13]) considered regularization-based robustness approaches. However, when there are outliers in the explanatory variables (covariates), these approaches do not seem to succeed [CCM13]. Meanwhile, approaches combining recent advances in robust mean estimation and gradient descent have proved remarkably powerful in the low-dimensional setting [PSBR18, KKM18, DKK*+*18], but for high dimensions, have so far only managed to address the setting where the covariance of the explanatory variables is the identity, or sparse [BDLS17, LSLC18]. Meanwhile, flexible and statistically optimal approaches ([Gao17]) have relied on intractable estimators such as Tukey-depth.

For the heavy-tail setting, another line of research considers estimators such as Median-of-Means (MOM) [NY83, JVV86, AMS99, Min15] and Catoni’s mean estimator [Cat12, Min18] only use weak moment assumptions. [Min15, BJL15, HS16] generalized these ideas to $M$ -estimation, yet it is not clear if these approaches apply to the high-dimensional setting with heavy tailed covariates.

Main Contributions. In this paper, we develop a sufficient condition that when satisfied, guarantees that an efficient algorithm (a variant of IHT) achieves the minimax optimal statistical rate. We show that our condition is flexible enough to apply to a number of important high-dimensional estimation problems under either heavy tails, or arbitrary corruption of the data. Specifically:

We consider two models. For our arbitrary corruption model, we assume that an adversary replaces an arbitrary $\epsilon$ -fraction of the authentic samples with arbitrary values (2.1). For the heavy-tailed model, we assume our data (response and covariates) satisfy only weak moment assumptions (2.2) without sub-Gaussian or sub-exponential concentration bounds. 2. 2.

We propose a notion that we call the Robust Descent Condition (RDC). Given any gradient estimator that satisfies the RDC, we define RHT – Robust Hard Thresholding (Algorithm 1) for sparsity constrained $M$ -estimation, and prove that Algorithm 1 converges linearly to a minimax statistically optimal solution. Thus the RDC and Robust Hard Thresholding form the basis for a Deterministic Meta-Theorem (3.1) that guarantees estimation error rates as soon as the RDC property of any gradient estimator can be certified. 3. 3.

We then obtain non-asymptotic bounds via certifying the RDC for different robust gradient estimators under various statistical models. (A) For corruptions in both response and explanatory variables, we show the trimmed gradient estimator satisfies the RDC. Thus our algorithm RHT has minimax-optimal statistical error, and tolerates $O({1}/{(\sqrt{k}\log(nd))})$ -fraction of outliers. This fraction is nearly independent of the $d$ , which is important in the high dimension regime. (B) In the heavy tailed regime, we use the Median-of-Means (MOM) gradient estimator. Our RHT algorithm obtains the sharpest available error bound, in fact nearly matching the results in the sub-Gaussian case. With either of these gradient estimators, our algorithm is computationally efficient, nearly matching vanilla gradient descent. This is in particular much faster than algorithms relying on sparse PCA relaxations as subroutines ([BDLS17, LSLC18]). 4. 4.

We use Robust Hard Thresholding for neighborhood selection [MB06] for estimating Gaussian graphical models, and provide model selection guarantees under adversarial corruption of the data; our results share similar robustness guarantees with sparse regression. 5. 5.

We demonstrate the effectiveness of Robust Hard Thresholding on both arbitrarily corrupted/heavy tailed synthetic data and (unmodified) real data.

A concrete illustration of 3(B) above: Consider a sparse linear regression problem without noise (sparse linear equations), with scaling $n={O}(k\log d)$ . When the covariates are sub-Gaussian, Lasso succeeds in exact recovery with high probability (as expected). When the covariates have only 4-th moments, we do not expect Lasso to succeed, and indeed experiments indicate this. Moreover, to the best of our knowledge, no previous efficient algorithm with ${O}(k\log(d))$ samples can guarantee exact recovery in this observation model ([FWZ16] has a statistical rate depending on the norm of the parameter $\bm{\beta}^{\ast}$ , and thus exact recovery for $\sigma=0$ is not guaranteed). Our contributions show that Robust Hard Thresholding using MOM achieves this (see also simulations in Figure 2(b)).

Related work

Sparse regression with arbitrary corruptions or heavy tails. Several works in robustness of high dimensional problems consider heavy tailed distributions or arbitrary corruptions only in the response variables [Li13, BJK15, BJK17, Loh17, KP18, HS16, Min15, CLZL]. Yet these algorithms cannot be trivially extended to the setting with heavy tails or corruptions in explanatory variables. Another line [ACG13, VMX17, YLA18, SS18] focuses on alternating minimization approaches which extend Least Trimmed Squares [Rou84]. However, these methods only have local convergence guarantees, and cannot handle arbitrary corruptions.

[CCM13] was one of the first papers to provide guarantees for sparse regression with arbitrary outliers in both response and explanatory variables by trimming the design matrix. Similar trimming techniques are also used in [FWZ16] for heavy tails in response and explanatory variables. Those results are specific to sparse regression, however, and cannot be easily extended to general $M$ -estimation problems. Moreover, even for linear regression, the statistical rates are not minimax optimal. [LM16] uses Median-of-Means tournaments to deal with heavy tails in the explanatory variables and obtains near optimal rates. However, Median-of-Means tournaments is not known to be computationally tractable. [LL17] deals with heavy tails and outliers in the explanatory variables, but they require higher moment bound (whose order is $O(\log(d))$ ) in the isotropic design case. [Gao17] optimizes Tukey depth [Tuk75, CGR18] for robust sparse regression under the Huber $\epsilon$ -contamination model, and their algorithm is minimax optimal and can handle a constant fraction of outliers. However, computing Tukey depth is intractable [JP78]. Recent results [BDLS17, LSLC18] leverage robust sparse mean estimation in robust sparse regression. Their algorithms are computationally tractable, and can tolerate $\epsilon=\text{const.}$ , but they require very restrictive assumptions on the covariance matrix ( $\bm{\Sigma}=\bm{I}_{d}$ or sparse), which precludes their use in applications such as graphical model estimation.

Robust $M$ -estimation via robust gradient descent. Works in [CSX17, HI17] and later [YCRB18a] first leveraged the idea of using robust mean estimation in each step of gradient descent, using a subroutine such as geometric median. A similar approach using more sophisticated robust mean estimation methods was later proposed in [PSBR18, DKK*+*18, YCRB18b, SX18, Hol18] for robust gradient descent. These methods all focused on low dimensional robust $M$ -estimation. Work in [LSLC18] extended the approach to the high-dimensional setting (though is limited to $\bm{\Sigma}=\bm{I}_{d}$ or sparse covariances). Even though the corrupted fraction $\epsilon$ can be independent of the ambient dimension $d$ by using sophisticated robust mean estimation algorithms [DKK*+*16, LRV16, SCV17], or the sum-of-squares framework [KKM18], these algorithms (except [LSLC18]) are not applicable to the high dimensional setting ( $n\ll d$ ), as they require at least $\Omega(d)$ samples.

Robust estimation of graphical models. A line of research using a robustified covariance matrix in Gaussian graphical models [LHY*+*12b, WG17, LT18] leverages GLasso [FHT08] or CLIME [CLL11] to estimate the sparse precision matrix. These robust methods are restricted to Gaussian graphical model estimation, and their techniques cannot be generalized to other $M$ -estimation problems.

Notation. We denote the Hard Thresholding operator of sparsity $k^{\prime}$ by $P_{k^{\prime}}$ , and denote the Euclidean projection onto the $\ell_{2}$ ball $B$ by $\Pi_{B}$ . We use $\operatorname{\mathbb{E}}_{i\in_{u}\mathcal{S}}$ to denote the expectation operator obtained by the uniform distribution over all samples $\{i\in\mathcal{S}\}$ .

2 Problem formulation

We now define the corruption and heavy tails model and sparsity constrained $M$ -estimation.

Definition 2.1 ( $\epsilon$ -corrupted samples).

Let $\{\bm{z}_{i},i\in{\mathcal{G}}\}$ be i.i.d. observations with distribution $P$ . We say that a collection of samples $\{\bm{z}_{i},i\in\mathcal{S}\}$ is $\epsilon$ -corrupted if an adversary chooses an arbitrary $\epsilon$ -fraction of the samples in ${\mathcal{G}}$ and modifies them with arbitrary values.

This corruption model allows corruptions in both explanatory and response variables in regression problems where we observe $\bm{z}_{i}=(y_{i},\bm{x}_{i})$ . 2.1 also allows the adversary to select an $\epsilon$ -fraction of samples to delete and corrupt.

Definition 2.2 (heavy-tailed samples).

For a distribution $P$ of $\bm{x}\in\operatorname{\mathbb{R}}^{d}$ with mean $\operatorname{\mathbb{E}}(\bm{x})$ and covariance $\bm{\Sigma}$ , we say that $P$ has bounded $2k$ -th moment, if there is a universal constant $C_{2k}$ such that, for a unit vector ${\bm{v}}\in\operatorname{\mathbb{R}}^{d}$ , we have $\operatorname{\mathbb{E}}_{P}\left\lvert\left\langle{{\bm{v}}},{\bm{x}-\operatorname{\mathbb{E}}(\bm{x})}\right\rangle\right\rvert^{2k}\leq C_{2k}\operatorname{\mathbb{E}}_{P}(\left\lvert\left\langle{{\bm{v}}},{\bm{x}-\operatorname{\mathbb{E}}(\bm{x})}\right\rangle\right\rvert^{2})^{k}$ .

2.2 allows heavy tails in both explanatory and response variables for $\bm{z}_{i}=(y_{i},\bm{x}_{i})$ . For example, in 4.3, we study linear regression with bounded 4-th moments for $\bm{x}$ and bounded variance for $y$ and noise.

Let $\ell:\operatorname{\mathbb{R}}^{d}\times\mathcal{Z}\rightarrow\operatorname{\mathbb{R}}$ be a convex and differentiable loss function. Our target is the unknown sparse population minimizer ${\bm{\beta}}^{*}=\arg\min_{{\bm{\beta}}\in\operatorname{\mathbb{R}}^{d},\left\lVert{\bm{\beta}}\right\rVert_{0}\leq k}\operatorname{\mathbb{E}}_{\bm{z}_{i}\sim P}\ell_{i}({\bm{\beta}};\bm{z}_{i})$ , and we write $f$ as the population risk, $f({\bm{\beta}})=\operatorname{\mathbb{E}}_{\bm{z}_{i}\sim P}\ell_{i}({\bm{\beta}};\bm{z}_{i})$ . Note that ${\bm{\beta}}^{*}$ ’s definition allows model misspecification. The following 2.3 provides general assumptions for the population risk.

Definition 2.3 (Strong convexity/smoothness).

For the population risk $f$ , we assume $\mu_{\alpha}\lVert{\bm{\beta}}_{1}-{\bm{\beta}}_{2}\rVert_{2}^{2}/2\leq f({\bm{\beta}}_{1})-f({\bm{\beta}}_{2})-\left\langle{\nabla f({\bm{\beta}}_{2})},{{\bm{\beta}}_{1}-{\bm{\beta}}_{2}}\right\rangle\leq\mu_{L}\lVert{\bm{\beta}}_{1}-{\bm{\beta}}_{2}\rVert_{2}^{2}/2$ , where $\mu_{\alpha}$ is the strong-convexity parameter and $\mu_{L}$ is the smoothness parameter. The condition number is $\rho={\mu_{L}}/{\mu_{\alpha}}\geq 1$ .

A well known result [NRWY12] considers ERM with convex relaxation from $\left\lVert{\bm{\beta}}\right\rVert_{0}$ to $\left\lVert{\bm{\beta}}\right\rVert_{1}$ , by certifying the RSC condition for sub-Gaussian ensembles – this obtains uniform convergence of the empirical risk. From an optimization viewpoint, existing results reveal that gradient descent algorithms equipped with soft-thresholding [ANW12] or hard-thresholding [BD09, JTK14, SL17, YLZ18, LB18] have linear convergence rate, and achieve known minimax lower bounds in statistical estimation [RWY11, ZWJ14].

Given samples $\mathcal{S}$ , running ERM on the entire input dataset: $\min_{{\bm{\beta}}\in B,\lVert{\bm{\beta}}\rVert_{0}\leq k}\operatorname{\mathbb{E}}_{i\in_{u}\mathcal{S}}\ell_{i}({\bm{\beta}};\bm{z}_{i})$ , cannot guarantee uniform convergence of the empirical risk, and can be arbitrarily bad for $\epsilon$ -corrupted samples. The next two sections outline the main results of this paper, addressing this problem.

3 Robust sparse estimation

via Robust Hard Thresholding

We introduce our meta-algorithm, Robust Hard Thresholding, that essentially uses a robust gradient estimator to run IHT. We require several definitions to specify the algorithm, and describe its results. We use $\widehat{\bm{G}}({\bm{\beta}})$ as a placeholder for the estimate at ${\bm{\beta}}$ , obtained from whichever robust gradient estimator we are using. Let ${\bm{G}}({\bm{\beta}})=\operatorname{\mathbb{E}}_{\bm{z}_{i}\sim P}\nabla\ell_{i}({\bm{\beta}};\bm{z}_{i})$ denote the population gradient. We use $\widehat{\bm{G}}$ and ${\bm{G}}$ when the context is clear.

Many previous works ([CSX17, HI17, PSBR18, DKK*+*18, YCRB18a, YCRB18b, SX18]) have provided algorithms for obtaining robust gradient estimators, then used as subroutines in robust gradient algorithms. However, those results require controlling $\|\widehat{\bm{G}}-{\bm{G}}\|_{2}$ , and do not readily extend to high dimensions, as sufficiently controlling $\|\widehat{\bm{G}}-{\bm{G}}\|_{2}$ seems to require $n=\Omega(d)$ . A recent work [LSLC18] on robust sparse linear regression uses a robust sparse mean estimator [BDLS17] to guarantee $\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{2}=O(\delta_{1}\left\lVert{{\bm{\beta}}}-{\bm{\beta}}^{*}\right\rVert_{2}+\delta_{2})$ with sample complexity $\Omega(k^{2}\log(d))$ . However, their algorithm requires the restrictive assumption $\bm{\Sigma}=I_{d}$ or sparse, and thus cannot be extended to more general $M$ -estimation problems.

To address this issue, we propose Robust Hard Thresholding (Algorithm 1), which uses hard thresholding after each robust gradient update222Our theory requires splitting samples across different iterations to maintain independence between iterations. We believe this is an artifact of the analysis, and do not use this in our experiments. [BWY17, PSBR18] use a similar approach for theoretical analysis. . In line 7, we use a gradient estimator to obtain the robust gradient estimate $\widehat{\bm{G}}^{t}$ . In line 8, we update the parameter by hard thresholding $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{{\bm{\beta}}}^{t+1}=P_{k^{\prime}}({\bm{\beta}}^{t}-\eta\widehat{\bm{G}}^{t})$ , where the hyper-parameter $k^{\prime}$ proportional to $k$ is specified in 2.3. A key observation in line 8 is that, in each step of IHT, the iterate ${\bm{\beta}}^{t}$ is sparse, and thus the perturbation from outliers or heavy tails only depends on IHT’s sparsity $k^{\prime}$ instead of the ambient dimension $d$ . Based on a careful analysis of the hard thresholding operator in each iteration, we show that rather than controlling $\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{2}$ , it is enough to control a weaker quantity: this is what we call the Robust Descent Condition 3.1 and we define it next; it plays a key role in obtaining sharp rates of convergence for various types of statistical models.

Robust Descent Condition

The Robust Descent Condition eq. 1 provides an upper bound on the inner product of the robust gradient estimate and the distance to the population optimum. This is a natural notion to control the potential progress obtained by using a robust gradient update instead of the population gradient.

Definition 3.1 ( $(\alpha,\psi)$ -Robust Descent Condition (RDC)).

For the population gradient ${{\bm{G}}}$ at ${\bm{\beta}}$ , a robust gradient estimator $\widehat{\bm{G}}({\bm{\beta}})$ satisfies the robust descent condition if for any sparse ${\bm{\beta}},\widetilde{{\bm{\beta}}}\in\operatorname{\mathbb{R}}^{d}$ ,

[TABLE]

We begin with a Meta-Theorem for Algorithm 1 that holds under the Robust Descent Condition 3.1 and assumptions on population risk 2.3. In 3.1, we prove Algorithm 1’s global convergence and its statistical guarantees. The proofs are collected in Appendix B.

Theorem 3.1 (Meta-Theorem).

Suppose we observe samples from a statistical model with population risk $f$ satisfying 2.3. If a robust gradient estimator satisfies $(\alpha,\psi)$ -Robust Descent Condition (3.1) where $\alpha\leq\frac{1}{32}\mu_{\alpha}$ , then Algorithm 1 with $\eta=1/\mu_{L}$ outputs $\widehat{{{\bm{\beta}}}}$ such that $\lVert\widehat{{{\bm{\beta}}}}-{\bm{\beta}}^{*}\rVert_{2}=O(\psi/\mu_{\alpha})$ , by setting $T={O}\left(\rho\log\left(\mu_{\alpha}{\left\lVert{\bm{\beta}}^{*}\right\rVert_{2}}/\psi\right)\right)$ .

We note that 3.1 is deterministic in nature. In the sequel, we omit the log term in the sample complexity due to sample splitting. We obtain high probability results via certifying that the RDC holds for certain robust gradient estimators under various statistical models. To obtain the minimax estimation error rate in 3.1, the key step is providing a robust gradient estimator with sufficiently small $\psi$ , in the definition of RDC.

Section 4 uses the RDC and 3.1 to obtain new results for sparse regression under heavy tails or arbitrary corruption. Before we move to this, we observe that we can use the RDC and 3.1 to recover existing results in the literature. Some immediate examples are as follows:

**Uncorrupted gradient satisfies the RDC. ** Suppose the samples follow from sparse linear regression with sub-Gaussian covariates and noise $\mathcal{N}(0,\sigma^{2})$ . The empirical average of gradient samples satisfies eq. 1 with $\psi=O(\sigma\sqrt{k\log(d)/n})$ , by assuming $\ell_{1}$ constraint on ${{\bm{\beta}}}$ and $\widetilde{{\bm{\beta}}}$ [LW11]. Plugging in this $\psi$ to 3.1 recovers the well-known minimax rates for sparse linear regression [RWY11].

**RSGE implies RDC. ** When $\bm{\Sigma}=\bm{I}_{d}$ or is sparse, [BDLS17] and [LSLC18], respectively, provide robust sparse gradient estimators (RSGE) which upper bound $\lVert\widehat{\bm{G}}({\bm{\beta}})-{{\bm{G}}}({\bm{\beta}})\rVert_{2}\leq\alpha\left\lVert{\bm{\beta}}-{\bm{\beta}}^{*}\right\rVert_{2}+\psi$ , for a constant fraction $\epsilon$ of corrupted samples. Noting that $\lvert\langle{\widehat{\bm{G}}({\bm{\beta}})-{{\bm{G}}}({\bm{\beta}})},{\widetilde{{\bm{\beta}}}-{\bm{\beta}}^{*}}\rangle\rvert\leq\lVert\widehat{\bm{G}}({\bm{\beta}})-{{\bm{G}}}({\bm{\beta}})\rVert_{2}\lVert\widetilde{{\bm{\beta}}}-{\bm{\beta}}^{*}\rVert_{2}$ , we observe that RSGE implies RDC. Hence any RSGE can be used in Algorithm 1. The RSGE for $\bm{\Sigma}=I$ in [BDLS17] guarantees an RDC with $\psi=O(\sigma\epsilon)$ when $n=\Omega(k^{2}\log d/\epsilon^{2})$ , and the RSGE for unknown sparse $\bm{\Sigma}$ from [LSLC18] guarantees $\psi=O(\sigma\sqrt{\epsilon})$ when $n=\Omega(k^{2}\log d/\epsilon)$ . Again plugging these values for $\psi$ into our theorem, recovers the results in those papers. 333It remains an open question to obtain a RSGE for a constant fraction of outliers for robust sparse regression with arbitrary covariance $\bm{\Sigma}$ .

4 Main Results: Using the RDC and Algorithm 1

In the remainder of our paper, we use 3.1 and the RDC to analyze two well-known and computationally efficient robust mean estimation subroutines that have been used in the low-dimensional setting: the trimmed mean estimator and the MOM estimator. We show that these two can obtain a sufficiently small $\psi$ in the definition of the RDC. This leads to the minimax estimation error in the case of arbitrary corruptions or heavy tails.

4.1 Gradient estimation

The trimmed mean and MOM estimators have been successfully applied to robustify gradient descent [YCRB18a, PSBR18] in the low dimensional setting. They have not been used in the high dimensional regime, however, because until now we have not had the machinery to analyze their algorithmic convergence, statistical rates and minimax optimality in the high dimensional setting.

To show they satisfy the RDC with a sufficiently small $\psi$ , we observe that by using Hölder’s inequality on the LHS of eq. 1, we have $\lvert\langle{\widehat{\bm{G}}({\bm{\beta}})-{{\bm{G}}}({\bm{\beta}})},{\widetilde{{\bm{\beta}}}-{\bm{\beta}}^{*}}\rangle\rvert\leq\lVert{\widehat{\bm{G}}({\bm{\beta}})-{{\bm{G}}}({\bm{\beta}})}\rVert_{\infty}\lVert\widetilde{{\bm{\beta}}}-{\bm{\beta}}^{*}\rVert_{1}.$ Using Algorithm 1, the Hard Thresholding step enforces sparsity of $\widetilde{{\bm{\beta}}}-{\bm{\beta}}^{*}$ . Therefore, controlling $\psi$ amounts to bounding the infinity norm of the robust gradient estimate.

In Section 4.2, we show that by using coordinate-wise robust mean estimation, we can certify the RDC with sufficiently small $\psi$ to guarantee minimax rates. Specifically, we show this for the trimmed gradient estimator for arbitrary corruption, and and the MOM gradient estimator for heavy tailed distributions.

Definition 4.1.

Given gradients samples $\{\nabla\ell_{i}({\bm{\beta}};\bm{z}_{i})\in\operatorname{\mathbb{R}}^{d},i\in\mathcal{S}\}$ , for each dimension $j\in[d]$ ,

$(\spadesuit)$ : Trimmed gradient estimator removes the largest and smallest $\alpha$ fraction of elements in $\{[\nabla\ell_{i}({\bm{\beta}};\bm{z}_{i})]_{j}\in\operatorname{\mathbb{R}},i\in\mathcal{S}\}$ , and calculates the mean of the remaining terms. We choose $\alpha=c_{0}\epsilon$ for constant $c_{0}\geq 1$ , and require $\alpha\leq 1/2-c_{1}$ for a small constant $c_{1}>0$ .

$(\clubsuit)$ : MOM gradient estimator partitions $\mathcal{S}$ into $4.5\lceil\log(d)\rceil$ blocks and computes the sample mean of $\{[\nabla\ell_{i}({\bm{\beta}};\bm{z}_{i})]_{j}\in\operatorname{\mathbb{R}}\}$ within each block, and then take the median of these means.444Without loss of generality, we assume the number of blocks divides $n$ , and $4.5\lceil\log(d)\rceil$ is chosen in [HS16].

4.2 Statistical guarantees

In this section, we consider some typical models for general $M$ -estimation.

Model 4.1 (Sparse linear regression).

Samples $\bm{z}_{i}=(y_{i},\bm{x}_{i})$ are drawn from a linear model $P$ : $y_{i}=\bm{x}_{i}^{\top}{\bm{\beta}}^{*}+\xi_{i}$ , with ${\bm{\beta}}^{*}\in\operatorname{\mathbb{R}}^{d}$ being $k$ -sparse. We assume that $\bm{x}$ ’s are i.i.d. with normalized covariance matrix $\bm{\Sigma}$ , with $\bm{\Sigma}_{jj}\leq 1$ $\forall j$ , and the stochastic noise $\xi$ has mean [math] and variance $\sigma^{2}$ .

Model 4.2 (Sparse logistic regression).

Samples $\bm{z}_{i}=(y_{i},\bm{x}_{i})$ are drawn from a binary classification model $P$ , where the binary label $y_{i}\in\{-1,+1\}$ follows the conditional probability distribution $\Pr(y_{i}|\bm{x}_{i})={1}/({1+\exp(-y_{i}\bm{x}_{i}^{\top}{\bm{\beta}}^{*})})$ , with ${\bm{\beta}}^{*}\in B\subset\operatorname{\mathbb{R}}^{d}$ being $k$ -sparse. We assume that $\bm{x}$ ’s are i.i.d. with normalized covariance matrix $\bm{\Sigma}$ , where $\bm{\Sigma}_{jj}\leq 1$ for all $j$ .

To obtain the following corollaries, we first certify the RDC for a certain robust gradient estimator over random ensembles with corruption or heavy tails, and then use them in 3.1. We collect the results for gradient estimation in Appendix A, and the proofs for corollaries in Appendix B.

Arbitrary corruption case.

Based on 3.1, we first provide concrete results for arbitrary corruption case 2.1, where the covariates and response variables in the authentic distribution $P$ are assumed to be sub-Gaussian.

Corollary 4.1.

Suppose we observe $n$ $\epsilon$ -corrupted (2.1) sub-Gaussian samples from sparse linear regression model (4.1). Under the condition $n=\Omega\left({\rho^{4}k\log d}\right)$ , and $\epsilon=O\Bigl{(}\frac{1}{\rho^{2}\sqrt{k}\log(nd)}\Bigr{)}$ , with probability at least $1-d^{-2}$ , Algorithm 1 with trimmed gradient estimator satisfies the RDC with $\psi=O(\rho\sigma\sqrt{k}({\epsilon\log(nd)}+\sqrt{{\log d}/{n}}))$ , and thus 3.1 provides $\lVert\widehat{{{\bm{\beta}}}}-{\bm{\beta}}^{*}\rVert_{2}=O({\rho^{2}\sigma}({\epsilon\sqrt{k}\log(nd)}+{\sqrt{{k\log d}/{n}}})).$

Time complexity. 4.1 has a global linear convergence rate. In each iteration, we only use $O(nd\log n)$ operations complexity to calculate trimmed mean. We incur logarithmic overhead compared to normal gradient descent [Bub15].

Statistical accuracy and robustness. Compared with [CCM13, BDLS17], our statistical error rate is minimax optimal [RWY11, ZWJ14], and has no dependencies on $\left\lVert{\bm{\beta}}^{*}\right\rVert_{2}$ . Furthermore, the upper bound on $\epsilon$ is nearly independent of $d$ , which guarantees Algorithm 1’s robustness in high dimensions.

Corollary 4.2.

Suppose we observe $n$ $\epsilon$ -corrupted (2.1) sub-Gaussian samples from sparse logistic regression model (4.2). With probability at least $1-d^{-2}$ , Algorithm 1 with trimmed gradient estimator satisfies the RDC with $\psi=O(\rho\sqrt{k}({\epsilon\log(nd)}+\sqrt{{\log d}/{n}}))$ , and thus 3.1 provides $\lVert\widehat{{{\bm{\beta}}}}-{\bm{\beta}}^{*}\rVert_{2}=O({\rho}^{2}({\epsilon\sqrt{k}\log(nd)}+{\sqrt{{k\log d}/{n}}}))$ .

Statistical accuracy and robustness. Under the sparse Gaussian linear discriminant analysis model (a typical example of 4.2), Algorithm 1 achieves the statistical minimax rate [LPR15, LYCR17].

Heavy-tailed distribution case.

We next turn to the heavy tailed distribution case 2.2.

Corollary 4.3.

Suppose we observe $n$ samples from sparse linear regression model (4.2) with bounded 4-th moment covariates. Under the condition $n=\Omega\left({\rho^{6}k\log d}\right)$ , with probability at least $1-d^{-2}$ , Algorithm 1 with MOM gradient estimator satisfies the RDC with $\psi=O({\rho^{3/2}\sigma}{\sqrt{{k\log d}/{n}}})$ , and thus 3.1 provides $\lVert\widehat{{{\bm{\beta}}}}-{\bm{\beta}}^{*}\rVert_{2}=O({\rho^{5/2}\sigma}{\sqrt{{k\log d}/{n}}})$ .

Time complexity. Similar to 4.1, 4.3 has a global linear convergence. In each iteration, we only use $O(nd)$ operations complexity – the same as normal gradient descent [Bub15].

Statistical accuracy. [LM16] uses Median-of-Means tournaments to deal with sparse linear regression with bounded moment assumptions for the covariates, and they obtain near optimal rates. We obtain similar rates, however our algorithm is efficient, where as Median-of-Means tournaments is not known to be computationally tractable. [FWZ16, Zhu17] deal with the same problem by truncating and shrinking the data to certify the RSC condition. Their results require boundedness of higher moments of the noise $\xi$ , and the final error depends on $\left\lVert{\bm{\beta}}^{*}\right\rVert_{2}$ . Our estimation error bounds exactly recover optimal sub-Gaussian bounds for sparse regression [NRWY12, Wai19], and moreover, we obtain exact recovery when $\xi$ ’s variance $\sigma^{2}\rightarrow 0$ .

Corollary 4.4.

Suppose we observe $n$ samples from sparse logistic regression model (4.2). With probability at least $1-d^{-2}$ , Algorithm 1 with MOM gradient estimator satisfies the RDC with $\psi=O({\rho^{3/2}}{\sqrt{{k\log d}/{n}}})$ , and thus 3.1 provides $\lVert\widehat{{{\bm{\beta}}}}-{\bm{\beta}}^{*}\rVert_{2}=O({\rho^{5/2}}{\sqrt{{k\log d}/{n}}})$ .

4.3 Sparsity recovery and Gaussian graphical model estimation

We next demonstrate the sparsity recovery performance of Algorithm 1 for graphical model learning [MB06, Wai09, RWL10, RWRY11, BvdG11, HTW15]. Our sparsity recovery guarantees hold for both heavy tails and arbitrary corruption, though we only present results in the case of arbitrary corruption in this section.

We use $\mathrm{supp}({\bm{v}},k)$ to denote top $k$ indexes of ${\bm{v}}$ with the largest magnitude. Let ${\bm{v}}_{\mathrm{min}}$ denote the smallest absolute value of nonzero element of ${\bm{v}}$ . To control the false negative rate, 4.5 shows that under the ${\bm{\beta}}_{\mathrm{min}}$ -condition, $\mathrm{supp}(\widehat{{\bm{\beta}}},k)$ is exactly $\mathrm{supp}({{\bm{\beta}}^{*}})$ . The proofs are given in Appendix C. Sparsity recovery guarantee for sparse logistic regression is similar, and is omitted due to space constraints. Existing results on sparsity recovery for $\ell_{1}$ regularized estimators [Wai09, LSRC15] do not require the RSC condition, but instead require an irrepresentability condition, which is stronger. If $\epsilon\rightarrow 0$ , 4.5 has the same ${\bm{\beta}}_{\mathrm{min}}$ -condition as IHT for sparsity recovery [YLZ18].

Corollary 4.5.

Under the same condition as in 4.1, and a ${\bm{\beta}}_{\mathrm{min}}$ -condition on ${\bm{\beta}}^{*}$ , ${\bm{\beta}}_{\mathrm{min}}^{*}=\Omega({\rho^{2}\sigma}({\epsilon\sqrt{k}\log(nd)}+\sqrt{{k\log d}/{n}}))$ , Algorithm 1 with trimmed gradient estimator guarantees that $\mathrm{supp}(\widehat{{\bm{\beta}}},k)=\mathrm{supp}({{\bm{\beta}}^{*}})$ , with probability at least $1-d^{-2}$ .

We consider sparse precision matrix estimation for Gaussian graphical models. The sparsity pattern of its precision matrix $\bm{\Theta}=\bm{\Sigma}^{-1}$ matches the conditional independence relationships [KFB09, WJ08].

Model 4.3 (Sparse precision matrix estimation).

Under the contamination model 2.1, authentic samples $\{\bm{x}_{i}\}_{i=1}^{m}$ are drawn from a multivariate Gaussian distribution $\mathcal{N}(0,\bm{\Sigma})$ . We assume that each row of the precision matrix $\bm{\Theta}=\bm{\Sigma}^{-1}$ is $(k+1)$ -sparse – each node has at most $k$ edges.

For the uncorrupted samples drawn from the Gaussian graphical model, the neighborhood selection (NS) algorithm [MB06] solves a convex relaxation of the following sparsity constrained optimization to regress each variable against its neighbors

[TABLE]

where $x_{ij}$ denotes the $j$ -th coordinate of $x_{i}\in\operatorname{\mathbb{R}}^{d}$ , and $(j)$ denotes the index set $\{1,\cdots,j-1,j+1,\cdots,d\}$ . Let $\bm{\theta}_{(j)}\in\operatorname{\mathbb{R}}^{d-1}$ denote $\bm{\Theta}$ ’s $j$ -th column with the diagonal entry removed. and $\bm{\Theta}_{j,j}\in\operatorname{\mathbb{R}}$ denote the $j$ -th diagonal element of $\bm{\Theta}$ . Then, the sparsity pattern of $\bm{\theta}_{(j)}$ can be estimated through $\widehat{{\bm{\beta}}}_{j}$ . Details on the connection between $\bm{\theta}_{(j)}$ and $\widehat{{\bm{\beta}}}_{j}$ are given in Appendix C.

However, given $\epsilon$ -corrupted samples from the Gaussian graphical model, this procedure will fail [LHY*+*12b, WG17]. To address this issue, we propose Robust NS (Algorithm 2 in Appendix C), which robustifies Neighborhood Selection [MB06] by using Robust Hard Thresholding (with least square loss) to robustify eq. 2. Similar to 4.5, a $\bm{\theta}_{\mathrm{min}}$ -condition guarantees consistent edge selection.

Corollary 4.6.

Under the same condition as in 4.1, and a $\bm{\theta}_{\mathrm{min}}$ -condition for $\bm{\theta}_{(j)}$ , $\bm{\theta}_{(j),\mathrm{min}}=\Omega({{\bm{\Theta}_{j,j}^{1/2}}\rho^{2}}({\epsilon\sqrt{k}\log(nd)}+\sqrt{{k\log d}/{n}}))$ , Robust NS (Algorithm 2) achieves consistent edge selection, with probability at least $1-d^{-1}$ .

Similar to 4.1, the fraction $\epsilon$ is nearly independent of dimension $d$ , which provides guarantees of Robust NS in high dimensions. Other Gaussian graphical model selection algorithms include GLasso [FHT08], CLIME[CLL11]. The experimental details comparing robustified versions of these algorithms are presented in Section D.4.

5 Experiments

We provide the complete details for our experiment setup in Appendix D.

**Sparse regression with arbitrary corruption. ** We generate samples from a sparse regression model (4.1) with a Toeplitz covariance $\bm{\Sigma}$ . Here, the stochastic noise $\xi\sim\mathcal{N}(0,\sigma^{2})$ , and we vary the noise level $\sigma^{2}$ in different simulations. We add outliers with $\epsilon=0.1$ , and track the parameter error $\left\lVert{{\bm{\beta}}^{t}}-{\bm{\beta}}^{*}\right\rVert_{2}$ in each iteration. Left plot of Figure 2 shows Algorithm 1’s linear convergence, and the error curves flatten out at the final error level. Furthermore, Algorithm 1 can achieve machine precision when $\sigma^{2}=0$ , which means exactly recovering of ${\bm{\beta}}^{*}$ .

**Sparse regression with heavy tails. ** We consider a log-normal distribution (a typical example of heavy tails) in 4.1. More specifically, $\bm{x}_{i}=\sqrt{\bm{\Sigma}}\widetilde{\bm{x}}_{i}$ , and $\xi_{i}=\sigma\widetilde{\xi}_{i}$ . Here, $\Sigma$ is the same Toeplitz covariance, each entry of $\widetilde{\bm{x}}_{i}$ and $\widetilde{\xi}_{i}$ follows from $(Z-\operatorname{\mathbb{E}}Z)/\sqrt{\operatorname{\mathrm{Var}}(Z)}$ , where $Z\sim\log\mathcal{N}(0,4)$ . We fix $k,d,\sigma$ , and vary sample size $n$ . For log-normal samples, we run Algorithm 1 with MOM and vanilla Lasso. We then re-generate standard Gaussian samples using the same dimensions with $\bm{\Sigma}$ and run Vanilla Lasso. Each curve in the right plot of Figure 2 is the average of 50 trials. Algorithm 1 with MOM significantly improves vanilla Lasso on log-normal data, and has the same performance as Lasso on sub-Gaussian data

**Real data experiments. ** We next apply Algorithm 2, to a US equities dataset [LHY*+*12a, ZLR*+*12], which is heavy-tailed and has many outliers [dP18]. The dataset contains 1,257 daily closing prices of 452 stocks (variables). It is well known that stocks from the same sector tend to be clustered together [Kin66]. Therefore, we use Robust NS (Algorithm 2) to construct an undirected graph among stocks. Graphs estimated by different algorithms are shown in Figure 2. We can see that stocks from the same sector are clustered together, and these clustering centers can be easily identified. We also compare Algorithm 2 to the baseline NS approach (as in the ideal setting). We can observe that stocks from Information Technology (colored by purple) are much better clustered by Algorithm 2.

Notations in Appendix.

In our proofs, the exponent $-10$ in tail bounds is arbitrary, and can be changed to other larger constant without affecting the results. $\{c_{j}\}_{j=0}^{3}$ denote universal constants, and they may change line by line.

Appendix A Proofs for the gradient estimators

In Robust Hard Thresholding (Algorithm 1), we use trimmed gradient estimator or MOM gradient estimator. And in 3.1, the key quantity to control the statistical rates of convergence is the Robust Descent Condition (3.1).

By Holder inequality, we have

[TABLE]

In this section, we provide one direct route for obtaining upper bound of Robust Descent Condition via bounding the infinity norm of the robust gradient estimate (A.1 and A.2).

Later, in Appendix B, we will leverage A.1 and A.2 in verifying the Robust Descent Condition for trimmed/MOM gradient estimator under sparse linear/logistic regression. Together with 3.1, this will complete 4.1 – 4.4.

Proposition A.1.

Suppose we observe $n$ $\epsilon$ -corrupted sub-Gaussian samples (2.1). With probability at least $1-{d^{-3}}$ , the coordinate-wise trimmed gradient estimator can guarantee

•

$\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{\infty}=O\left(\sqrt{\lVert{\bm{\beta}}-{\bm{\beta}}^{*}\rVert_{2}^{2}+\sigma^{2}}\left({\epsilon\log(nd)}+\sqrt{{\log d}/{n}}\right)\right)$ * for sparse linear regression (4.1).*

•

$\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{\infty}=O\left({\epsilon\log(nd)}+\sqrt{{\log d}/{n}}\right)$ * for sparse logistic regression (4.2).*

Proposition A.2.

Suppose we observe $n$ samples from the heavy tailed model with bounded 4-th moment covariates. With probability at least $1-{d^{-3}}$ , the coordinate-wise Median of Means gradient estimator can guarantee

•

$\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{\infty}=O\left(\sqrt{\rho^{2}\lVert{\bm{\beta}}-{\bm{\beta}}^{*}\rVert_{2}^{2}+\rho\sigma^{2}}\sqrt{{\log d}/{n}}\right)$ * for sparse linear regression;*

•

$\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{\infty}=O\left(\sqrt{\rho{\log d}/{n}}\right)$ * for sparse logistic regression.*

A.1 Proofs for the MOM gradient estimator

We first prove A.2. A.1 of trimmed gradient estimator for $\epsilon$ -corrupted sub-Gaussian samples has the same dependency on $\lVert{\bm{\beta}}-{\bm{\beta}}^{*}\rVert_{2}$ . The proof of A.1 leverages standard concentration bound for sub-Gaussian samples, and then uses trimming to control the effect of outliers.

Proof of A.2.

For $\ell_{2}$ loss function, we have ${\bm{g}}({\bm{\beta}})=\bm{x}({\bm{x}}^{\top}{{\bm{\beta}}}-y)$ , where we omit the subscript $i$ in the proof. We denote $\Delta\coloneqq{\bm{\beta}}-{\bm{\beta}}^{*}$ , and bound the operator norm of the covariance of gradient samples

[TABLE]

where (i) follows from the Holder inequality, and (ii) follows from the 4-th moment bound assumption.

Hence, by using coordinate-wise Median of Means gradient estimator, we have

[TABLE]

with probability at least $1-{d^{-4}}$ , where (i) follows from Proposition 5 in [HS16]. Applying union bounds on all d indexes, we have $\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{\infty}=O\left(\sqrt{\rho^{2}\lVert{\bm{\beta}}-{\bm{\beta}}^{*}\rVert_{2}^{2}+\rho\sigma^{2}}\sqrt{{\log d}/{n}}\right)$ with probability at least $1-{d^{-3}}$ .

For logistic loss, the gradient can be computed as: ${\bm{g}}=\frac{-y\bm{x}}{1+\exp\left(y\bm{x}^{\top}{\bm{\beta}}\right)},$ where we omit the subscript $i$ in the proof.

Since $y\in\{-1,+1\}$ , and ${1+\exp\left(y\bm{x}^{\top}{\bm{\beta}}\right)}\geq 1$ , we directly have $\lVert{\operatorname{\mathbb{E}}({\bm{g}}-{\bm{G}})({\bm{g}}-{\bm{G}})^{\top}}\rVert_{\rm op}\leq\left\lVert\bm{\Sigma}\right\rVert_{\rm op}.$ Similar to the case of $\ell_{2}$ loss, we have $\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{\infty}=O\left(\sqrt{\rho{\log d}/{n}}\right)$ , with probability at least $1-{d^{-3}}$ .

∎

A.2 Proofs for the trimmed gradient estimator

We then turn to the trimmed gradient estimator for the case of arbitrary corruption. Before we proceed to the trimmed estimator, let us first visit the definition and tail bounds of sub-exponential random variable, as it will be used in sparse linear regression, where the gradient’s distribution is indeed sub-exponential under the sub-Gaussian assumptions in 4.1.

We first present standard concentration inequalities ([Wai19]).

Definition A.1 (Sub-exponential random variables).

A random variable $X$ with mean $\mu$ is sub-exponential if there are non-negative parameters $\nu$ such that

[TABLE]

Lemma A.1 (Bernstein’s inequality).

Suppose that $X_{i},i=1,\cdots,n$ , are i.i.d. sub-exponential random variables with parameters $\nu$ . Then

[TABLE]

We also have a two-sided tail bound

[TABLE]

We define $\alpha$ -trimmed mean estimator for one dimensional samples, and denote it as ${\sf trmean}_{\alpha}(\cdot)$ .

Definition A.2 ( $\alpha$ -trimmed mean estimator).

Given a set of $\epsilon$ -corrupted samples $\{\bm{z}_{i}\in\operatorname{\mathbb{R}},i\in\mathcal{S}\}$ , the coordinate-wise trimmed mean estimator ${\sf trmean}_{\alpha}(\cdot)$ removes the largest and smallest $\alpha$ fraction of elements in $\{\bm{z}_{i}\in\operatorname{\mathbb{R}},i\in\mathcal{S}\}$ , and calculate the mean of the remaining terms. We choose $\alpha=c_{0}\epsilon$ , for a constant $c_{0}\geq 1$ . We also require that $\alpha\leq 1/2-c_{1}$ , for some small constant $c_{1}>0$ .

A.2 shows the guarantees for this robust gradient estimator in each coordinate. We note that A.2 is stronger than guarantees for trimmed mean estimator (Lemma 3) in [YCRB18a]. In our contamination model 2.1, the adversary may delete $\epsilon$ -fraction of authentic samples, and then add arbitrary outliers. And A.2 provides guarantees for trimmed mean estimator on sub-exponential random variables. The trimmed mean estimator is robust enough, that it allows the adversary to arbitrarily remove $\epsilon$ -fraction of data points. We use ${\mathcal{G}}^{j}$ to denote the $\operatorname{\mathbb{R}}^{1}$ samples at the $j$ -th coordinate of ${\mathcal{G}}$ . We can also define $\mathcal{S}^{j}$ in the same way.

Lemma A.2.

Suppose we observe $n=\Omega(\log d)$ $\epsilon$ -corrupted samples from 2.1. For each dimension $j\in\{1,2,\cdots,d\}$ , we assume the samples in ${\mathcal{G}}^{j}$ are i.i.d. $\nu$ -sub-exponential with mean $\bm{\mu}^{j}$ . After the contamination, we have the $j$ -th $\operatorname{\mathbb{R}}^{1}$ samples as $\mathcal{S}^{j}$ . Then, we can guarantee the trimmed mean estimator on $j$ -th dimension that

[TABLE]

with probability at least $1-{d^{-4}}$ .

We leave the proof of A.2 at the end of this section. Then, we present analysis of trimmed gradient estimator for sparse linear regression and sparse logistic regression by using A.2. For sparse linear regression model with sub-Gaussian covariates, the distribution of authentic gradients are sub-exponential instead of sub-Gaussian. More specifically, we first prove that when the current parameter iterate is ${\bm{\beta}}$ , the sub-exponential parameter of all authentic gradient is $O((\left\lVert\Delta\right\rVert_{2}^{2}+\sigma^{2})^{1/2})$ , where $\Delta\coloneqq{\bm{\beta}}-{\bm{\beta}}^{*}$ .

To gain some intuition for this, we can consider the sparse linear equation problem, where $\sigma^{2}=0$ . When ${\bm{\beta}}={\bm{\beta}}^{*}(\left\lVert\Delta\right\rVert_{2}^{2}=0)$ , we exactly recover ${\bm{\beta}}^{*}$ , and all stochastic gradients of authentic samples are actually zero vectors, as all observations are noiseless. It is clear that we will have sub-exponential parameter as [math].

Proof of A.1.

For any ${\bm{\beta}}$ , the gradient for one sample can be written as

[TABLE]

where we omit the subscript $i$ in the proof. For any fixed standard basis vector ${\bm{v}}\in\mathbb{S}^{d-1}$ , and define $\Delta={\bm{\beta}}-{\bm{\beta}}^{*}$ , we have

[TABLE]

To characterize the tail bounds of ${\bm{v}}^{\top}{\bm{g}}$ , we study the moment generating function:

[TABLE]

We denote $\gamma\in\{-1,+1\}$ as a Rademacher random variable, which is independent of $\bm{x}$ and $\xi$ . Then we can use a standard symmetrization technique [Wai19],

[TABLE]

where $(i)$ follows from the exponential function’s power series expansion, and $(ii)$ follows from the independence of $\gamma$ , together with the fact that all odd moments of the $\gamma$ terms have zeros means.

By the Cauchy-Schwarz inequality, we have

[TABLE]

It is clear that $\xi$ is a sub-Gaussian random variable with parameter $\sigma$ . Since $\bm{x}\sim\mathcal{N}\left(0,\bm{\Sigma}\right)$ , we have ${\bm{v}}^{\top}\bm{x}\sim\mathcal{N}\left(0,{\bm{v}}^{\top}\bm{\Sigma}{\bm{v}}\right)$ . For any fixed standard basis vector ${\bm{v}}\in\mathbb{S}^{d-1}$ , we can conclude that ${\bm{v}}^{\top}\bm{x}$ is sub-Gaussian with parameter at most $1$ based on 4.1. By basic properties of sub-Gaussian random variables [Wai19], we have

[TABLE]

where $(i)$ follows from the fact that $\bm{x}^{\top}\Delta-\xi$ is the weighted summation of two independent sub-Gaussian random variables. Hence, we have

[TABLE]

where $(i)$ follows from $\left(4l\right)!\leq 2^{4l}\left(\left(2l\right)!\right)^{2}$ (proof by mathematical induction). When we have $f\left(t\right)={32te^{2}\sqrt{\left\lVert\Delta\right\rVert_{2}^{2}+\sigma^{2}}}<1$ , eq. 5 converges to $\frac{1}{1-f^{2}\left(t\right)}$ . Hence,

[TABLE]

That being said, ${\bm{v}}^{\top}{\bm{g}}$ is a sub-exponential random variable. By choosing ${\bm{v}}$ as each coordinate in $\operatorname{\mathbb{R}}^{d}$ , each coordinate of gradient has sub-exponential parameter as $32\sqrt{2}e^{2}\sqrt{\left\lVert\Delta\right\rVert_{2}^{2}+\sigma^{2}}$ .

Then, applying A.2 on this collection of corrupted sub-exponential random variables, we have

[TABLE]

with probability at least $1-{d^{-4}}$ .

Applying union bounds on eq. 6 for all $d$ indexes, we have

[TABLE]

with probability at least $1-{d^{-3}}$ .

In this subsection, we use A.2 to bound $\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{\infty}$ for sparse logistic regression. The technique for sparse logistic regression is similar to linear regression. Since we can directly show the sub-Gaussian distribution of gradient in this case, applying A.2 leads to the bound for $\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{\infty}$ .

Under the statistical model of sparse logistic regression, the gradient can be computed as:

[TABLE]

where we omit the subscript $i$ in the proof.

Since $y\in\{-1,+1\}$ , and ${1+\exp\left(y\bm{x}^{\top}{\bm{\beta}}\right)}\geq 1$ , then for any fixed standard basis vector ${\bm{v}}\in\mathbb{S}^{d-1}$ , ${\bm{v}}^{\top}{\bm{g}}$ is sub-Gaussian with parameter at most $1$ based on 4.2. Notice that $\nu$ -sub-Gaussian random variables are still $\nu$ -sub-exponential. Applying A.2 again, we have

[TABLE]

with probability at least $1-{d^{-4}}$ .

Applying union bounds on eq. 7 for all $d$ indexes, we have

[TABLE]

with probability at least $1-{d^{-3}}$ . ∎

A.3 Trimmed mean estimator for strong contamination model

Now, it only remains to prove A.2. The proof technique is as follow: even though an adversary may delete samples from ${\mathcal{G}}^{j}$ , we can still show the concentration inequalities for remaining authentic $\operatorname{\mathbb{R}}^{1}$ samples (denoting as $\widetilde{\mathcal{G}}^{j}$ in the proof). Then, we show that by using trimmed mean estimator, either the abnormal outliers will be removed, or their effect is controlled.

Proof of A.2.

Without loss of generality, we assume $\bm{\mu}=0$ throughout the proof.

For each dimension $j\in\{1,2,\cdots,d\}$ , we can split the $j$ -th one-dimensional samples as $\mathcal{S}^{j}=\widetilde{\mathcal{G}}^{j}\bigcup\mathcal{B}^{j}$ . To study the performance of ${\sf trmean}_{\alpha}\{{x}_{i}:i\in\mathcal{S}^{j}\}$ , we first show a concentration inequality of the sub-exponential variables in $\widetilde{\mathcal{G}}^{j}$ , without worrying about removing points from ${\mathcal{G}}^{j}$ . This part of our proof is similar to Lemma 4.5 in [DKK*+*16].

Concentration inequality for $\widetilde{\mathcal{G}}^{j}$

We consider the set $\{{x}_{i}:i\in{\mathcal{G}}^{j}\}$ in $\operatorname{\mathbb{R}}^{1}$ . Since $\widetilde{\mathcal{G}}^{j}$ is a subset of ${\mathcal{G}}^{j}$ , by triangle inequality we have,

[TABLE]

The first term $A_{1}$ is simply the average of i.i.d. sub-exponential random variables. By A.1, we have

[TABLE]

For the second term $A_{2}$ , We now wish to show that with probability $1-\tau$ , there does not exist a subset $\widetilde{\mathcal{G}}^{j}$ so that the $A_{2}$ is more than $\delta_{0}$ . This event is equivalent to

[TABLE]

Let $\delta_{1}=\frac{1-\epsilon}{\epsilon}\delta_{0}$ . For one subset ${\mathcal{G}}^{j}\setminus\widetilde{\mathcal{G}}^{j}$ , by A.1, we have

[TABLE]

Then, we take union bounds over all possible ${\mathcal{G}}^{j}\setminus\widetilde{\mathcal{G}}^{j}$ , which have $\binom{n}{\epsilon n}$ events. Hence, the tail probability of $A_{2}$ can be bounded as

[TABLE]

where (i) follows from the fact that $\log\binom{n}{\epsilon n}=O(nH(\epsilon))$ for $n$ large enough, and $H(\cdot)$ is the binary entropy function. Choosing $\delta_{1}=c_{1}\nu\log(nd)$ , and hence $\delta_{0}=c_{1}\nu\epsilon\log(nd)$ , we have $\tau\leq c_{0}\exp(-c_{2}n\epsilon\log(nd))\leq c_{3}{d^{-10}}$ .

Combining the analysis on $A_{1}$ and $A_{2}$ (eq. 9 and eq. 10), we have

[TABLE]

This completes the concentration bounds on $\left\lvert{\operatorname{\mathbb{E}}_{i\in_{u}\widetilde{\mathcal{G}}^{j}}x_{i}}\right\rvert$ for all possible samples in $\widetilde{\mathcal{G}}^{j}$ without worrying about sample removing.

Trimmed mean estimator for $\mathcal{S}^{j}$

Then, we can consider the contribution of each part in $\mathcal{S}^{j}=\widetilde{\mathcal{G}}^{j}\bigcup\mathcal{B}^{j}$ . We denote the remaining set after trimming as $\mathcal{R}^{j}$ , and the trimmed set as $\mathcal{T}^{j}$ . Recall that we assume $\bm{\mu}=0$ , we only need to bound $\left\lvert{{\sf trmean}_{\alpha}\{{x}_{i}:i\in\mathcal{S}^{j}\}}\right\rvert$ , which is the empirical average of all samples in the remaining set $\{{x}_{i}:i\in\mathcal{R}^{j}\}$ .

As $\mathcal{R}^{j}$ can be easily separated by the union of two distinct set $\mathcal{B}^{j}\bigcap\mathcal{R}^{j}$ and $\widetilde{\mathcal{G}}^{j}\bigcap\mathcal{R}^{j}$ , we have the following inequalities,

[TABLE]

For any $i\in\widetilde{\mathcal{G}}^{j}$ , by A.1, we have

[TABLE]

Applying a union bound for all samples, we can control the maximum magnitude for any $i\in\widetilde{\mathcal{G}}^{j}$ ,

[TABLE]

We can bound $B_{1}$ by applying eq. 11. For the trimmed good samples $\{i\in\widetilde{\mathcal{G}}^{j}\bigcap\mathcal{T}^{j}\}$ , we have $B_{2}\leq 2\alpha n\max_{i\in\widetilde{\mathcal{G}}^{j}}\left\lvert x_{i}\right\rvert$ . Since we choose $\alpha\geq\epsilon$ , we have $B_{3}\leq\epsilon n\max_{i\in\widetilde{\mathcal{G}}^{j}}\left\lvert x_{i}\right\rvert$ .

Putting together the pieces, and choosing $\alpha=c\epsilon$ for some universal constant $c\geq 1$ , we have

[TABLE]

with probability at least $1-d^{-4}$ . This completes the proof for A.2. ∎

Appendix B Statistical estimation

via Robust Hard Thresholding

Here, we provide the Meta-Theorem 3.1 for statistical estimation performance of Algorithm 1 under statistical models.

We first introduce a supporting Lemma on the property of hard thresholding operator.

Lemma B.1 (Lemma 1 in [LB18]).

We set $k^{\prime}$ in hard thresholding operator as $k^{\prime}=kc_{\rho}^{2}$ , where $c_{\rho}\geq 1$ , then we have

[TABLE]

Note that $c_{\rho}$ will be specified as $2\rho$ later in the proof, as we choose $k^{\prime}=4\rho^{2}k$ as in 2.3.

Proof of 3.1.

We first study the objective function gap $f\left(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{{\bm{\beta}}}^{t}\right)-f\left({\bm{\beta}}^{*}\right)$ . Since the population risk $f$ satisfies $\mu_{\alpha}$ -strong convexity and $\mu_{L}$ -smoothness (2.3), we have

[TABLE]

where (i) follows from the fact that ${\bm{\beta}}^{*},{\bm{\beta}}^{t-1}\in B$ , and $\mu_{\alpha}$ -strong convexity holds.

Combining these two inequalities, we obtain

[TABLE]

Expanding the last term, we also have

[TABLE]

For the term $T_{1}$ , recall that $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{{\bm{\beta}}}^{t}$ is obtained from hard thresholding, and $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{{\bm{\beta}}}^{t}=P_{k^{\prime}}\left({\bm{\beta}}^{t-1}-\eta\widehat{\bm{G}}\left({\bm{\beta}}^{t-1}\right)\right)$ , we apply B.1 with $\bm{z}={\bm{\beta}}^{t-1}-\eta\widehat{\bm{G}}\left({\bm{\beta}}^{t-1}\right)$ :

[TABLE]

The term $T_{2}$ can be bounded by using eq. 1 in 3.1. We have

[TABLE]

with probability at least $1-d^{-3}$ .

We denote $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\Delta}^{t}=\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{{\bm{\beta}}}^{t}-{\bm{\beta}}^{*}$ and $\Delta^{t}={\bm{\beta}}^{t}-{\bm{\beta}}^{*}$ . Since, $\eta\mu_{\alpha}\geq\frac{1}{\mu_{L}}\cdot\mu_{\alpha}=\frac{1}{\rho}$ , putting together the pieces, we have

[TABLE]

with probability at least $1-d^{-3}$ . Applying convexity, $f\left(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{{\bm{\beta}}}^{t}\right)-f\left({\bm{\beta}}^{*}\right)\geq 0$ , as ${\bm{\beta}}^{*}$ is the population minimizer. Hence, we have

[TABLE]

with probability at least $1-d^{-3}$ .

Notice that eq. 15 is a quadratic inequality for $\left\lVert\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\Delta}^{t}\right\rVert_{2}$ , and we can use the root of eq. 15 to upper bound $\left\lVert\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\Delta}^{t}\right\rVert_{2}$ :

[TABLE]

where (i) follows from the basic inequality $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ for non-negative $a,b$ .

We choose $c_{\rho}=2\rho$ , and this leads to $\sqrt{\left(1-\frac{1}{\rho}\right)/\left(1-\frac{1}{c_{\rho}}\right)}\leq 1-\frac{1}{4\rho}$ . Under the condition $\alpha\leq\frac{1}{32}\mu_{\alpha}$ , we have $\frac{2\alpha}{\mu_{L}\left(1-\frac{1}{c_{\rho}}\right)}\leq\frac{1}{8\rho}$ . Then,

[TABLE]

Since ${\bm{\beta}}^{t+1}=\Pi_{B}\left(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{{\bm{\beta}}}^{t+1}\right)$ is projection onto a convex set, by the property of Euclidean projection [Bub15], we have

[TABLE]

Together with eq. 17, eq. 16 establishes global linear convergence of $\Delta^{t}$ .

We apply a union bound on $T$ iterates. Since $1-Td^{-3}\geq 1-{d}^{-2}$ for sufficiently large $d$ , we have

[TABLE]

with probability at least $1-{d}^{-2}$ . Hence, we can achieve the final error

[TABLE]

by setting $T={O}\left(\rho\log\left(\frac{\mu_{\alpha}\left\lVert{\bm{\beta}}^{*}\right\rVert_{2}}{\psi}\right)\right)$ .

∎

B.1 Sparse linear regression

Proof of 4.1 and 4.3.

With A.2, A.1 in hand, we prove the arbitrary corruption case 4.1, and the proof of heavy tailed distribution 4.3 is similar. We evaluate the RDC 3.1 in Algorithm 1 for trimmed gradient estimator. With probability at least $1-d^{-3}$ , we have

[TABLE]

where (i) follows from Holder inequality, (ii) follows from the sparsity of ${{\bm{\beta}}^{t-1}-{\bm{\beta}}^{*}}$ in Algorithm 1, (iii) follows from plugging in A.1, which yields $\alpha=\sqrt{k^{\prime}+k}\left({\epsilon\log(nd)}+\sqrt{{\log d}/{n}}\right),\psi=\sigma\sqrt{k^{\prime}+k}\left({\epsilon\log(nd)}+\sqrt{{\log d}/{n}}\right)$ .

We apply a union bound on $T$ iterates, and $1-Td^{-3}\geq 1-{d}^{-2}$ for sufficiently large $d$ . The condition $\alpha\leq\frac{1}{32}\mu_{\alpha}$ in 3.1 can be achieved if

[TABLE]

Since $\rho=\mu_{L}/\mu_{\alpha}$ , and $\mu_{L}\geq 1$ , these conditions can be expressed as

[TABLE]

The final error can be expressed as

[TABLE]

∎

B.2 Sparse logistic regression

Proof of 4.2 and 4.4.

We prove 4.2, and the proof of 4.4 is similar. With probability at least $1-{d}^{-3}$ , we have

[TABLE]

where (i) follows from the proof of 4.1 by using $\lVert\widehat{\bm{G}}-{\bm{G}}\rVert_{\infty}=O\left(\sqrt{{\log d}/{n}}\right)$ in A.1, and $\alpha=0,\psi=\sqrt{k^{\prime}+k}\left({\epsilon\log(nd)}+\sqrt{{\log d}/{n}}\right)$ .

Similar to the proof in sparse linear regression, this final error can be expressed as

[TABLE]

∎

Appendix C Sparsity recovery

and sparse precision matrix estimation

C.1 Sparsity recovery guarantee

The same as the main text, we use $\mathrm{supp}({\bm{v}},k)$ to denote top $k$ indexes of ${\bm{v}}$ with the largest magnitude. Let ${\bm{v}}_{\mathrm{min}}$ denote the smallest absolute value of nonzero elements of ${\bm{v}}$ .

Proof of 4.5.

The sparsity recovery guarantee is similar to [YLZ18]. Since $\widehat{{\bm{\beta}}}$ is $k^{\prime}$ sparse ( $k^{\prime}\geq k$ ) by the definition of hard thresholding operator, we use $\widehat{{\bm{\beta}}}_{k}$ to denote $P_{k}(\widehat{{\bm{\beta}}})$ . We use the technique proof by contradiction. If $\mathrm{supp}(\widehat{{\bm{\beta}}},k)\neq\mathrm{supp}({{\bm{\beta}}^{*}})$ , we at least have $\ell_{2}$ error as ${\bm{\beta}}_{\mathrm{min}}^{*}$ . Hence, ${\bm{\beta}}_{\mathrm{min}}^{*}\leq\left\lVert\widehat{{\bm{\beta}}}_{k}-{\bm{\beta}}^{*}\right\rVert_{2}\overset{(i)}{\leq}2\left\lVert\widehat{{\bm{\beta}}}-{\bm{\beta}}^{*}\right\rVert_{2}\overset{(ii)}{=}O\left({\rho}^{2}\sigma\left(\epsilon\sqrt{k}\log(nd)+\sqrt{\frac{k\log d}{n}}\right)\right)$ , where (i) follows from the triangle inequality and definition of hard thresholding $\lVert\widehat{{\bm{\beta}}}_{k}-{\bm{\beta}}^{*}\rVert_{2}\leq\lVert\widehat{{\bm{\beta}}}_{k}-\widehat{{\bm{\beta}}}\rVert_{2}+\lVert\widehat{{\bm{\beta}}}-{{\bm{\beta}}}^{*}\rVert_{2}\leq 2\lVert\widehat{{\bm{\beta}}}-{\bm{\beta}}^{*}\rVert_{2}$ , and (ii) follows from the statistical guarantee in 4.1.

This contradicts with the ${\bm{\beta}}_{\mathrm{min}}$ -condition in 4.5, and hence we have the result in 4.5. ∎

C.2 Model selection for Gaussian graphical models

We then start to consider the sparsity recovery results for sparse precision matrix estimation – this is the part of 4.6. We first use following notations for a Gaussian graphical model.

We use $\bm{x}_{i}$ to denote the $i$ -th samples of Gaussian graphical model, and $X_{j}$ to denote the $j$ -th random variable. Let $(j)$ be the index set $\{1,\cdots,j-1,j+1,\cdots,d\}.$ We use $\bm{\Sigma}_{(j)}=\bm{\Sigma}_{(j),(j)}\in\operatorname{\mathbb{R}}^{(d-1)\times(d-1)}$ to denote the sub-matrix of covariance matrix $\bm{\Sigma}$ with both $j$ -th row and $j$ -th column removed, and use $\bm{\sigma}_{(j)}\in\operatorname{\mathbb{R}}^{d-1}$ to denote $\bm{\Sigma}$ ’s $j$ -th column with the diagonal entry removed. Also, we use $\bm{\theta}_{(j)}\in\operatorname{\mathbb{R}}^{d-1}$ to denote $\bm{\Theta}$ ’s $j$ -th column with the diagonal entry removed. and $\bm{\Theta}_{j,j}\in\operatorname{\mathbb{R}}$ to denote the $j$ -th diagonal element of $\bm{\Theta}$ .

By basic probability computation, for each $j=1,\cdots,d$ , the variable $X_{j}$ conditioning $\bm{X}_{(j)}$ follows from a Gaussian distribution $\mathcal{N}(\bm{X}_{(j)}^{\top}\bm{\Sigma}_{(j)}^{-1}\bm{\sigma}_{(j)},1-\bm{\sigma}_{(j)}^{\top}\bm{\Sigma}_{(j)}^{-1}\bm{\sigma}_{(j)})$ . Then we have the linear regression formulation $X_{j}=\bm{X}_{(j)}^{\top}{\bm{\beta}}_{j}+\xi_{j}$ , where ${\bm{\beta}}_{j}=\bm{\Sigma}_{(j)}^{-1}\bm{\sigma}_{(j)}$ and $\xi_{j}\sim\mathcal{N}(0,1-\bm{\sigma}_{(j)}^{\top}\bm{\Sigma}_{(j)}^{-1}\bm{\sigma}_{(j)})$ . Notice the definition of precision matrix $\bm{\Theta}$ , we have ${\bm{\beta}}_{j}=-\bm{\theta}_{(j)}/\bm{\Theta}_{j,j}$ , and $\bm{\Theta}_{j,j}=1/\operatorname{\mathrm{Var}}(\xi_{j})$ . Thus for the $j$ -th variable, $\bm{\theta}_{(j)}$ and ${\bm{\beta}}_{j}$ have the same sparsity pattern. Hence, the sparsity pattern of $\bm{\theta}_{(j)}$ can be estimated through $\widehat{{\bm{\beta}}}_{j}$ via solving the optimization eq. 2 (Neighborhood Selection in [MB06]).

In Algorithm 2, we robustify Neighborhood Selection by using Robust Hard Thresholding (with $\ell_{2}$ loss and trimmed gradient estimator) to robustify eq. 2. In line 6, we use Robust Hard Thresholding to regress each variable against its neighbors. In line 9, the sparsity pattern of $\bm{\Theta}$ can be estimated by aggregating the neighborhood support set of $\{\widehat{{\bm{\beta}}}_{j}\}_{j=1}^{d}$ via intersection or union. Similar to 4.5, a $\bm{\theta}_{\mathrm{min}}$ -condition guarantees consistent edge selection.

Proof of 4.6.

Algorithm 2 iteratively uses Algorithm 1 as a Neighborhood Selection approach for each variable. Hence, we can apply 4.5 for each variable, and the sparsity patterns are the same according to $\bm{\theta}_{(j)}=-{\bm{\beta}}_{j}/\operatorname{\mathrm{Var}}(\xi_{j})$ . The stochastic noise term $\sigma$ in sparse linear regression can be expressed as $1/\sqrt{\bm{\Theta}_{j,j}}$ . Hence, under the same condition as 4.1, for each $j\in[d]$ , we require a $\bm{\theta}_{\mathrm{min}}$ -condition for $\bm{\theta}_{(j)}$ , $\bm{\theta}_{(j),\mathrm{min}}=\Omega\left({{\bm{\Theta}_{j,j}^{1/2}}\rho^{2}}\left({\epsilon\sqrt{k}\log(nd)}+\sqrt{\frac{k\log d}{n}}\right)\right)$ .

Using a union bound, we conclude that Algorithm 2 is consistent in edge selection, with probability at least $1-d^{-1}$ . ∎

Appendix D Full experiments details

We study empirical performance of Robust Hard Thresholding (Algorithm 1 and Algorithm 2). And we present the complete details of experimental setup in Section 5.

D.1 Synthetic data – sparse linear models

We first consider the performance of Algorithm 1 under (generalized) linear models with $\epsilon$ -corrupted samples.

**Sparse linear regression. ** In the first experiemtn, we consider an exact sparse linear regression model (4.1). In this model, the stochastic noise $\xi\sim\mathcal{N}(0,\sigma^{2})$ , and we vary the noise level $\sigma^{2}$ in different simulations. We first generate authentic explanatory variables with parameters $k=5,d=1000,n=300$ , from a Gaussian distribution $\mathcal{N}(\bm{0}_{d},\bm{\Sigma})$ , where the covariance matrix $\bm{\Sigma}$ is a Toeplitz matrix with an exponential decay $\bm{\Sigma}_{ij}=\exp^{-|i-j|}$ . This design matrix is known to enjoy the RSC-condition [RWY10], which meets the requirement of 4.1. The entries of the $k$ -sparse true parameter ${\bm{\beta}}^{*}$ are set to either $+1$ or $-1$ . Fixing the contamination level at $\epsilon=0.1$ , we set the covariates of the outliers as $A$ , where $A$ is a random $\pm 1$ matrix of dimension $\frac{\epsilon}{1-\epsilon}\times d$ , and the responses of outliers to $-A{\bm{\beta}}^{*}$ .

To show the performance of Algorithm 1 under different noise levels determined by $\sigma^{2}$ , we track the parameter error $\left\lVert{{\bm{\beta}}^{t}}-{\bm{\beta}}^{*}\right\rVert_{2}$ in each iteration. In the left plot of Figure 3, Algorithm 1 shows linear convergence, and the error curves flatten out at the level of the final error, which is consistent with our theory. Furthermore, Algorithm 1 can achieve machine precision when $\sigma^{2}=0$ , which means exact recovery of ${\bm{\beta}}^{*}$ .

**Misspecified model. ** For the second experiment, we use a sparse linear regression with model misspecification – the underlying authentic samples do not follow a linear model. We use the same Toeplitz covariates and true parameter ${\bm{\beta}}^{*}$ , but and corresponding $y_{i}$ ’s are calculated as $y_{i}=\sum_{j=1}^{d}\bm{x}_{ij}^{3}{\bm{\beta}}_{j}^{*}$ . Although this is a non-linear function, sparse linear regression on these authentic samples can still recover the support, as the cubic function is monotone and ${\bm{\beta}}^{*}$ is sparse. We generate outliers using the same distribution as the first experiment, but with a different fraction of corruptions $\epsilon$ .

For simplicity, we track the function evaluated on all authentic samples $F({\bm{\beta}})=\sum_{i\in{\mathcal{G}}}(y_{i}-\bm{x}_{i}^{\top}{\bm{\beta}})^{2}$ . In the right plot of Figure 3, we show the performance of Algorithm 1 under different $\epsilon$ , and the oracle curve means using IHT only on authentic samples. The right plot has similar convergence under different values of corrupted fraction $\epsilon$ , and shows the robustness of Algorithm 1 without assuming an underlying linear model.

D.2 Robust $M$ -estimators via Robust Hard Thresholding

Classical robust $M$ -estimators [Loh17] (such as empirical risk minimization using Huber loss) are widely used in robust statistics in the case where the error distribution is heavy tailed or when there are arbitrary outliers only in the response variables. In the high dimensional setting, given $\epsilon$ -corrupted samples 2.1, we can use

[TABLE]

where $\ell_{i}({\bm{\beta}};\bm{z}_{i})$ can be chosen as Huber loss with parameter $\delta$ :

[TABLE]

[Loh17] studied robust $M$ -estimators in high dimensions, and proposed a composite optimization using $\left\lVert{\bm{\beta}}\right\rVert_{1}$ instead of $\left\lVert{\bm{\beta}}\right\rVert_{0}$ . They established local convergence guarantee for this composite optimization procedure, using a local RSC condition in a neighborhood around ${\bm{\beta}}^{*}$ . Yet their results do not trivially extend to settings with arbitrarily corrupted covariates.

In our experiments, we use Huber loss in Robust Hard Thresholding to deal with heavy-tailed error distribution. In addition to heavy-tailed noise, $\epsilon$ -fraction of $\{y_{i},\bm{x}_{i}\}_{i=1}^{n}$ are still arbitrarily corrupted.

For the experiments, we use the same Toeplitz covariates and true parameter ${\bm{\beta}}^{*}$ as in previous experiments on sparse linear models with fixed dimension parameters $k=5,d=1000,n=300$ . The error distribution is a Cauchy distribution, which is a special case model misspecification, as it doesn’t meet the sub-Gaussian requirement in 4.1. For different contamination levels, we set the covariates of the outliers as $A$ , where $A$ is a random $\pm 1$ matrix of dimension $\frac{\epsilon}{1-\epsilon}\times d$ , and the responses of outliers to $-A{\bm{\beta}}^{*}$ .

Empirically, we observe linear convergence, and this is shown in Figure 4. This linear convergence results validates the local RSC condition proposed in [Loh17], and we can still achieve this even with $\epsilon$ -fraction of corrupted covariates.

D.3 Sparse logistic regression

For binary classification problem, we generate samples from a sparse LDA problem, where the distributions of the explanatory variables conditioned on the response variables follow multivariate Gaussian distributions with the same covariance matrix but different means.

We generate authentic samples $\bm{x}_{i}$ from a Gaussian distribution $\mathcal{N}(\bm{\mu}_{+},\bm{I}_{d})$ if $y_{i}=+1$ , and another distribution $\mathcal{N}(\bm{\mu}_{-},\bm{I}_{d})$ if $y_{i}=-1$ . The parameters are fixed $k=5,d=1000,n=300$ . We set $\bm{\mu}_{+}=\bm{1}_{d}+\bm{v}$ , where $\bm{v}$ is $k$ -sparse and its entries are set to be either $+1/\sqrt{k}$ or $-1/\sqrt{k}$ . And we set $\bm{\mu}_{-}=\bm{1}_{d}-\bm{v}$ . The Bayes classifier is ${\bm{\beta}}^{*}=2\bm{v}$ . This is a special case of 4.2, and it is known that sparse logistic regression attains fast classification error rates [LPR15]. We then set the covariates of the outliers as $A$ , where $A$ is a matrix of dimension $\frac{\epsilon}{1-\epsilon}\times d$ , where the entries are random $\pm 3$ . The responses of outliers follow the distribution $\Pr(y_{i}|\bm{x}_{i})={1}/({1+\exp(y_{i}\bm{x}_{i}^{\top}{\bm{\beta}}^{*})})$ , which is exactly the opposite of 4.2.

We run Algorithm 1 with logistic loss under different levels of outlier fraction $\epsilon$ . In the left plot of Figure 5, we observe similar linear convergence as sparse linear regression This is consistent with 4.2 for sparse logistic regression, and it is clear that we cannot exactly recover ${\bm{\beta}}^{*}$ unless the number of samples $n$ is infinite.

We then compare Algorithm 1 with the Trimmed Lasso estimator for sparse logistic regression [YLA18]. Although they also use a trimming technique, their algorithm is totally different from Algorithm 1, as we use coordinate-wise trimmed mean estimator for gradients in hard thresholding, but they trim samples in each iteration according to the each sample’s loss. Under the same sparse LDA model, we set $k=\sqrt{d},n=15k$ . In simulation, we increase $d$ , and plot classification error (averaged over 50 trials on authentic test set) for different $\epsilon=0.1,0.2$ . The right plot of Figure 5 shows that Robust Hard Thresholding is better than Trimmed Lasso.

D.4 Synthetic data – Gaussian graphical model

We generate Gaussian graphical model samples by huge [ZLR*+*12]. We choose the “cluster” sparsity pattern, where the clustering parameters are default values in the package where the number of clusters in the graph is $d/20$ , the probability that a pair of nodes within a cluster are connected is 0.3, and there are no edges between nodes within different clusters. The off-diagonal elements of the precision matrix is denoted as $v$ , which is an experiment parameter for SNR.

We then add an additional $\frac{\epsilon}{1-\epsilon}$ fraction of samples sampled from another distribution. Following the experimental design in [YL15, WG17], each outlier is generated by a mixture of $d$ -dimensional Gaussian distributions $\frac{1}{2}\mathcal{N}(\bm{\mu}^{o},\bm{\Sigma}^{o})+\frac{1}{2}\mathcal{N}(-\bm{\mu}^{o},\bm{\Sigma}^{o})$ , where $\bm{\mu}^{o}=(1.5,1.5,\cdots,1.5)^{\top},\text{ and }\bm{\Sigma}^{o}=\bm{I}_{d}.$ We compare Algorithm 2 with other existing methods: Trimmed GLasso [YLA18], RCLIME [WG17], Skeptic [LHY*+*12b], and Spearman [LT18]. The latter two are based on robustifying the covariance matrix, and then using standard graphical model selection algorithms such as GLasso or CLIME. To directly compare these methods, we use CLIME for both of them.

To evaluate model selection performance, we use receiver operating characteristic (ROC) curves to compare our method to others over the full regularization paths. We generate regularization paths for other robust algorithms by tuning the $\lambda$ in CLIME and GLasso. For Algorithm 2, we explicitly tune different sparsity level $k^{\prime}$ to generate the regualization path.

We set $\epsilon=0.1$ , and vary $(n,d)$ , and the SNR parameter $v$ for off-diagonal elements. We use different $(n,d)=(100,100),(200,200)$ . For different off-diagonal values, we set $v=0.3$ (Low SNR), and $v=0.6$ (High SNR). We show ROC curves to demonstrate model selection performance in Figure 6. For the entire regularization path, our algorithm (denoted as Robust NS) has a better ROC compared to other algorithms.

In particular, Robust NS outperforms other methods with higher true positive rate when the false positive rate is small. This is the case where we use smaller hard thresholding sparsity in Algorithm 2, and larger regularization parameter for $\left\lVert\Theta\right\rVert_{1}$ other methods based on GLasso and CLIME. This validates our theory in 4.6, which guarantees sparsity recovery when hard thresholding hyper-parameter $k^{\prime}$ is suitably chosen to match ${\bm{\beta}}^{*}$ ’s sparsity $k$ .

D.5 Real data experiments

Here, we present details of the experiment using US equities data [ZLR*+*12]. We preprocess it by taking log-transformation and calculate the corresponding daily returns. Obvious outliers are removed by winsorizing each variable so that all samples are within five times the winsorized standard deviation from the winsorized mean. After preprocessing, we present example histograms and QQ plots from the Information Technology sector. In Figure 7 , we list the histograms of two typical companies in this sector. As we can see from Figure 8, even after preprocessing on these stock prices, they are still highly non-normal and heavy tailed. We do not add any manual outliers as financial data is already heavy tailed and have many outliers [dP18]. We also compare Algorithm 2 with the baseline NS approach (without consideration for corruptions or outliers).

We limit the number of edges to 2,000 for both methods. The cluster colored by purple denotes the Information Technology sector. In Figure 9, we can easily separate different clusters by using Robust NS. However, the Vanilla NS approach cannot distinguish the sector Information Technology (purple). Furthermore, we can observe that stocks from Information Technology (colored by purple) are much better clustered by Algorithm 2.

Bibliography86

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ACG 13] Andreas Alfons, Christophe Croux, and Sarah Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics , pages 226–248, 2013.
2[AMS 99] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences , 58(1):137–147, 1999.
3[ANW 12] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Statist. , 40(5):2452–2482, 10 2012.
4[BD 09] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis , 27(3):265–274, 2009.
5[BDLS 17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 2017 Conference on Learning Theory , 2017.
6[BDMS 13] Afonso S. Bandeira, Edgar Dobriban, Dustin G. Mixon, and William F. Sawin. Certifying the restricted isometry property is hard. IEEE Transactions on Information Theory , 59(6):3448–3450, 2013.
7[BJK 15] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems , pages 721–729, 2015.
8[BJK 17] Kush Bhatia, Prateek Jain, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems , pages 2107–2116, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

High Dimensional Robust MMM-Estimation: Arbitrary Corruption and Heavy Tails

Abstract

1 Introduction

Related work

2 Problem formulation

Definition 2.1** (ϵ\epsilonϵ-corrupted samples).**

Definition 2.2** (heavy-tailed samples).**

Definition 2.3** (Strong convexity/smoothness).**

3 Robust sparse estimation

Robust Descent Condition

Definition 3.1** ((α,ψ)(\alpha,\psi)(α,ψ)-Robust Descent Condition (RDC)).**

Theorem 3.1** (Meta-Theorem).**

4 Main Results: Using the RDC and Algorithm 1

4.1 Gradient estimation

Definition 4.1**.**

4.2 Statistical guarantees

Model 4.1** (Sparse linear regression).**

Model 4.2** (Sparse logistic regression).**

Arbitrary corruption case.

Corollary 4.1**.**

Corollary 4.2**.**

Heavy-tailed distribution case.

Corollary 4.3**.**

Corollary 4.4**.**

4.3 Sparsity recovery and Gaussian graphical model estimation

Corollary 4.5**.**

Model 4.3** (Sparse precision matrix estimation).**

Corollary 4.6**.**

5 Experiments

Notations in Appendix.

Appendix A Proofs for the gradient estimators

Proposition A.1**.**

Proposition A.2**.**

A.1 Proofs for the MOM gradient estimator

Proof of A.2.

A.2 Proofs for the trimmed gradient estimator

Definition A.1** (Sub-exponential random variables).**

Lemma A.1** (Bernstein’s inequality).**

Definition A.2** (α\alphaα-trimmed mean estimator).**

Lemma A.2**.**

Proof of A.1.

A.3 Trimmed mean estimator for strong contamination model

Proof of A.2.

Concentration inequality for G~j\widetilde{\mathcal{G}}^{j}G​j

Trimmed mean estimator for Sj\mathcal{S}^{j}Sj

Appendix B Statistical estimation

Lemma B.1** (Lemma 1 in [LB18]).**

Proof of 3.1.

B.1 Sparse linear regression

Proof of 4.1 and 4.3.

B.2 Sparse logistic regression

Proof of 4.2 and 4.4.

Appendix C Sparsity recovery

C.1 Sparsity recovery guarantee

Proof of 4.5.

C.2 Model selection for Gaussian graphical models

Proof of 4.6.

Appendix D Full experiments details

D.1 Synthetic data – sparse linear models

D.2 Robust MMM-estimators via Robust Hard Thresholding

D.3 Sparse logistic regression

D.4 Synthetic data – Gaussian graphical model

D.5 Real data experiments

High Dimensional Robust $M$ -Estimation: Arbitrary Corruption and Heavy Tails

Definition 2.1 ( $\epsilon$ -corrupted samples).

Definition 2.2 (heavy-tailed samples).

Definition 2.3 (Strong convexity/smoothness).

Definition 3.1 ( $(\alpha,\psi)$ -Robust Descent Condition (RDC)).

Theorem 3.1 (Meta-Theorem).

Definition 4.1.

Model 4.1 (Sparse linear regression).

Model 4.2 (Sparse logistic regression).

Corollary 4.1.

Corollary 4.2.

Corollary 4.3.

Corollary 4.4.

Corollary 4.5.

Model 4.3 (Sparse precision matrix estimation).

Corollary 4.6.

Proposition A.1.

Proposition A.2.

Definition A.1 (Sub-exponential random variables).

Lemma A.1 (Bernstein’s inequality).

Definition A.2 ( $\alpha$ -trimmed mean estimator).

Lemma A.2.

Concentration inequality for $\widetilde{\mathcal{G}}^{j}$

Trimmed mean estimator for $\mathcal{S}^{j}$

Lemma B.1 (Lemma 1 in [LB18]).

D.2 Robust $M$ -estimators via Robust Hard Thresholding