Robust multivariate mean estimation: the optimality of trimmed mean

Gabor Lugosi; Shahar Mendelson

arXiv:1907.11391·math.ST·February 25, 2020

Robust multivariate mean estimation: the optimality of trimmed mean

Gabor Lugosi, Shahar Mendelson

PDF

TL;DR

This paper introduces a multivariate trimmed-mean estimator for robust mean estimation under adversarial contamination, demonstrating its optimality with minimal assumptions.

Contribution

It proposes a new multivariate trimmed-mean estimator and proves its optimality in robust mean estimation with adversarial noise.

Findings

01

Estimator achieves optimal robustness bounds

02

Performs well under minimal assumptions

03

Outperforms existing methods in contaminated settings

Abstract

We consider the problem of estimating the mean of a random vector based on i.i.d. observations and adversarial contamination. We introduce a multivariate extension of the trimmed-mean estimator and show its optimal performance under minimal conditions.

Equations204

μ = μ (X_{1}, \dots, X_{N}) \in R^{d} .

μ = μ (X_{1}, \dots, X_{N}) \in R^{d} .

∥ μ - μ ∥ \leq ε (N, δ) with probability at least 1 - δ

∥ μ - μ ∥ \leq ε (N, δ) with probability at least 1 - δ

\frac{1}{N} i = 1 \sum N X_{i} - μ \geq ε with probability at least c \frac{σ _{X}^{2}}{ε ^{2} N}

\frac{1}{N} i = 1 \sum N X_{i} - μ \geq ε with probability at least c \frac{σ _{X}^{2}}{ε ^{2} N}

∣ μ - μ ∣ \leq c σ_{X} \frac{lo g ( 2/ δ )}{N} with probabilty 1 - δ

∣ μ - μ ∣ \leq c σ_{X} \frac{lo g ( 2/ δ )}{N} with probabilty 1 - δ

∣ μ - μ ∣ \leq c σ_{X} (η + \frac{lo g ( 1/ δ )}{N})

∣ μ - μ ∣ \leq c σ_{X} (η + \frac{lo g ( 1/ δ )}{N})

∥ μ - μ ∥ \leq c (\frac{Tr ( Σ )}{N} + \frac{λ _{1} lo g ( 1/ δ )}{N})

∥ μ - μ ∥ \leq c (\frac{Tr ( Σ )}{N} + \frac{λ _{1} lo g ( 1/ δ )}{N})

∥ μ - μ ∥ \leq c (\frac{Tr ( Σ )}{N} + \frac{λ _{1} lo g ( 1/ δ )}{N} + λ_{1} η) ?

∥ μ - μ ∥ \leq c (\frac{Tr ( Σ )}{N} + \frac{λ _{1} lo g ( 1/ δ )}{N} + λ_{1} η) ?

∥ μ - μ ∥ \leq c λ_{1} η

∥ μ - μ ∥ \leq c λ_{1} η

∥ μ - μ ∥ \leq c (\frac{d}{N} + \frac{lo g ( 1/ δ )}{N} + η) .

∥ μ - μ ∥ \leq c (\frac{d}{N} + \frac{lo g ( 1/ δ )}{N} + η) .

Q_{p} (\overline{X}) = sup {M \in R : P (\overline{X} \geq M) \geq 1 - p} .

Q_{p} (\overline{X}) = sup {M \in R : P (\overline{X} \geq M) \geq 1 - p} .

X_{i} = min {X_{i}, μ + Q_{1 - η /2} (\overline{X})} .

X_{i} = min {X_{i}, μ + Q_{1 - η /2} (\overline{X})} .

P (\overline{X} \geq Q_{1 - η /2} (\overline{X})) = \frac{η}{2},

P (\overline{X} \geq Q_{1 - η /2} (\overline{X})) = \frac{η}{2},

{i : X_{i} - μ \geq Q_{1 - η /2} (\overline{X})} \leq \frac{3}{4} η N .

{i : X_{i} - μ \geq Q_{1 - η /2} (\overline{X})} \leq \frac{3}{4} η N .

Z = min {X, μ + Q_{1 - η /2} (\overline{X})} .

Z = min {X, μ + Q_{1 - η /2} (\overline{X})} .

∣ E Z - μ ∣ = E [(\overline{X} - M) \mathbbm 1_{\overline{X} \geq M}] .

∣ E Z - μ ∣ = E [(\overline{X} - M) \mathbbm 1_{\overline{X} \geq M}] .

\overline{E} (η, X)

\overline{E} (η, X)

\overline{E} (η, X) \leq C σ_{X} \frac{lo g ( 2/ δ )}{N} .

\overline{E} (η, X) \leq C σ_{X} \frac{lo g ( 2/ δ )}{N} .

\overline{E} (η, X) + C σ_{X} \frac{lo g ( 2/ δ )}{N},

\overline{E} (η, X) + C σ_{X} \frac{lo g ( 2/ δ )}{N},

E (η, X) = def. max {E [∣ \overline{X} ∣ \mathbbm 1_{\overline{X} \leq Q_{η /2} (\overline{X})}], E [∣ \overline{X} ∣ \mathbbm 1_{\overline{X} \geq Q_{1 - η /2} (\overline{X})}]} .

E (η, X) = def. max {E [∣ \overline{X} ∣ \mathbbm 1_{\overline{X} \leq Q_{η /2} (\overline{X})}], E [∣ \overline{X} ∣ \mathbbm 1_{\overline{X} \geq Q_{1 - η /2} (\overline{X})}]} .

c σ_{X} max {η, \frac{lo g ( 2/ δ )}{N}}

c σ_{X} max {η, \frac{lo g ( 2/ δ )}{N}}

ϕ_{α, β} (x) = ⎩ ⎨ ⎧ β x α \mbox i f x > β, \mbox i f x \in [α, β], \mbox i f x < α,

ϕ_{α, β} (x) = ⎩ ⎨ ⎧ β x α \mbox i f x > β, \mbox i f x \in [α, β], \mbox i f x < α,

ε = 8 η + 12 \frac{lo g ( 4/ δ )}{N} .

ε = 8 η + 12 \frac{lo g ( 4/ δ )}{N} .

μ = \frac{1}{N} i = 1 \sum N ϕ_{α, β} (X_{i}) .

μ = \frac{1}{N} i = 1 \sum N ϕ_{α, β} (X_{i}) .

∣ μ - μ ∣ \leq 3 E (4 ε, X) + 2 σ_{X} \frac{lo g ( 4/ δ )}{N} .

∣ μ - μ ∣ \leq 3 E (4 ε, X) + 2 σ_{X} \frac{lo g ( 4/ δ )}{N} .

∣ μ - μ ∣ \leq 10 ε σ_{X} .

∣ μ - μ ∣ \leq 10 ε σ_{X} .

\frac{ε}{2} = P (\overline{X} \geq M) \leq \frac{σ _{X}^{2}}{M ^{2}},

\frac{ε}{2} = P (\overline{X} \geq M) \leq \frac{σ _{X}^{2}}{M ^{2}},

Q_{1 - ε /2} (\overline{X}) \leq \frac{σ _{X} 2}{ε} .

Q_{1 - ε /2} (\overline{X}) \leq \frac{σ _{X} 2}{ε} .

E [(\overline{X} - M) \mathbbm 1_{\overline{X} \geq M}]

E [(\overline{X} - M) \mathbbm 1_{\overline{X} \geq M}]

E (ε, X) \leq σ_{X} 8 ε .

E (ε, X) \leq σ_{X} 8 ε .

∣ μ - μ ∣ \leq C σ_{X} \frac{lo g ( 2/ δ )}{N},

∣ μ - μ ∣ \leq C σ_{X} \frac{lo g ( 2/ δ )}{N},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Robust multivariate mean estimation: the optimality of trimmed mean

Gábor Lugosilabel=e1][email protected] [

Shahar Mendelsonlabel=e2][email protected] [ ICREA\thanksmarkm1, Pompeu Fabra University\thanksmarkm2, Barcelona GSE\thanksmarkm3, The Australian National University\thanksmarkm4

ICREA

Pg. Lluís Companys 23

08010 Barcelona, Spain

and Department of Economics and Business

Pompeu Fabra University, Barcelona, Spain

Mathematical Sciences Institute,

The Australian National University

Canberra, Australia

Abstract

We consider the problem of estimating the mean of a random vector based on i.i.d. observations and adversarial contamination. We introduce a multivariate extension of the trimmed-mean estimator and show its optimal performance under minimal conditions.

62J02,,

62G08,

60G25,

mean estimation,

robust estimation,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

, and

t1Gábor Lugosi was supported by the Spanish Ministry of Economy and Competitiveness, Grant PGC2018-101643-B-I00; “High-dimensional problems in structured probabilistic models - Ayudas Fundación BBVA a Equipos de Investigación Cientifica 2017”; and Google Focused Award “Algorithms and Learning for AI”.

1 Introduction

Estimating the mean of a random vector based on independent and identically distributed samples is one of the most basic statistical problems. In the last few years the problem has attracted a lot of attention and important advances have been made both in terms of statistical performance and computational methodology.

In the simplest form of the mean estimation problem, one wishes to estimate the expectation $\mu=\mathbb{E}X$ of a random vector $X$ taking values in $\mathbb{R}^{d}$ , based on a sample $X_{1},\ldots,X_{N}$ consisting of independent copies of $X$ . An estimator is a (measurable) function of the data

[TABLE]

We measure the quality of an estimator by the distribution of its Euclidean distance to the mean vector $\mu$ . More precisely, for a given $\delta>0$ —the confidence parameter—, one would like to ensure that

[TABLE]

with $\varepsilon(N,\delta)$ as small as possible. Here and in the entire article, $\|\cdot\|$ denotes the Euclidean norm in $\mathbb{R}^{d}$ .

The obvious choice of $\widehat{\mu}$ is the empirical mean $N^{-1}\sum_{i=1}^{N}X_{i}$ , which, apart from its computational simplicity, has good statistical properties when the distribution is sufficiently well-behaved. However, it is well known that, even when $X$ is real valued, the empirical mean behaves sub-optimally and much better mean estimators are available111We refer the reader to the recent survey [20] for an extensive discussion.. The reason for the suboptimal performance of the empirical mean is the damaging effect of outliers that are inevitably present when the distribution is heavy-tailed.

Informally put, outliers are sample points that are, in some sense, atypical; as a result they cause a significant distortion to the empirical mean. The crucial fact is that when $X$ is a heavy-tailed random variable, a typical sample contains a significant number of outliers, implying the empirical mean is likely to be distorted.

To exhibit the devastating effect that outliers cause, let $\varepsilon>0$ and note that there is a square integrable (univariate) random variable $X$ such that

[TABLE]

for a positive absolute constant $c$ ; $\sigma_{X}^{2}$ is the variance of $X$ . In other words, the best possible error $\varepsilon(N,\delta)$ that can be guaranteed by the empirical mean (when only finite variance is assumed) is of the order of $\sigma_{X}/\sqrt{\delta N}$ . On the other hand, it is well known (see, e.g., the survey [20]) that there are estimators of the mean $\widehat{\mu}$ such that for all square-integrable random variables $X$ ,

[TABLE]

where $c$ is a suitable absolute constant. An estimator that performs with an error $\varepsilon(N,\delta)$ of the order of $\sigma_{X}\sqrt{\log(2/\delta)/N}$ is called a sub-Gaussian estimator. Such estimators are optimal in the sense that no estimator can perform with a better error $\varepsilon(N,\delta)$ even if $X$ is known to be a Gaussian random variable.

Because the empirical mean is such a simple estimator and seeing that outliers are the probable cause of its sub-optimality, for real-valued random variables, a natural attempt to improve the performance of the empirical mean is removing possible outliers using a truncation of $X$ . Indeed, the so-called trimmed-mean (or truncated-mean) estimator is defined by removing a fraction of the sample, consisting of the $\gamma N$ largest and smallest points for some parameter $\gamma\in(0,1)$ , and then averaging over the rest. This idea is one of the most classical tools in robust statistics and we refer to Tukey and McLaughlin [28], Huber and Ronchetti [15], Bickel [1], Stigler [26] for early work on the theoretical properties of the trimmed-mean estimator. However, the non-asymptotic sub-Gaussian property of the trimmed mean was established only recently, by Oliveira and Orenstein in [23]. They proved that if $\gamma=\kappa\log(1/\delta)/N$ for a constant $\kappa$ , then the trimmed mean estimator $\widehat{\mu}$ satisfies (1.1) for all distributions with a finite variance $\sigma_{X}$ and with a constant $c$ that depends on $\kappa$ only.

An added value of the trimmed mean is that it seems to be robust to malicious noise, at least intuitively. Indeed, assume that an adversary can corrupt $\eta N$ of the $N$ points for some $\eta<1$ . The trimmed-mean estimator can withstand at least one sort of contamination: the adversary making the corrupted points either very large or very small. This does not rule out other damaging changes to the sample, but at least it gives the trimmed mean another potential edge over other estimators. And, in fact, as we prove in this article, the performance of the trimmed-mean estimator is as good as one can hope for under both heavy-tailed distributions and adversarial corruption. We show that—a simple variant of—the trimmed-mean estimator achieves

[TABLE]

with probability $1-\delta$ , for an absolute constant $c$ (see Theorem 1 for the detailed statement). The bound (1.2) holds for all univariate distributions with a finite variance, and is minimax optimal in that class of distributions. For distributions with lighter tail, the dependence on the contamination level $\eta$ can be improved. For example, for sub-Gaussian distributions $\sqrt{\eta}$ may be replaced by $\eta\sqrt{\log(1/\eta)}$ and the trimmed-mean estimator achieves that. As we explain in what follows, the parameter $\gamma$ that determines the level of trimming depends on the confidence parameter $\delta$ and contamination level $\eta$ only.

The problem of mean estimation in the multivariate case (i.e., when $X$ takes values in $\mathbb{R}^{d}$ for some $d>1$ ) is considerably more complex. For i.i.d. data without contamination, the best possible statistical performance for square-integrable random vectors is well understood: if $\Sigma=\mathbb{E}\left[(X-\mu)(X-\mu)^{T}\right]$ is the covariance matrix of $X$ whose largest eigenvalue and trace are denoted by $\lambda_{1}$ and $\mathrm{Tr}(\Sigma)$ , respectively, then for every $\delta>0$ , there exists a mean estimator $\widehat{\mu}$ such that, regardless of the distribution, with probability at least $1-\delta$ ,

[TABLE]

for some absolute constant $c$ . This bound is optimal in the sense that one cannot improve it even when the distribution is known to be Gaussian. The existence of such a “sub-Gaussian” estimator was established by Lugosi and Mendelson [21]. Computationally efficient versions have been subsequently constructed by Hopkins [14] and by Cherapanamjeri, Flammarion, and Bartlett [5], see also Depersin and Lecué [7]. Once again, we refer to the survey [20] for related results.

A natural question is how well one can estimate the mean of a random vector in the presence of adversarial contamination. In particular, one may ask the following:

Let $X$ be a random vector in $\mathbb{R}^{d}$ whose mean and covariance matrix exist. Let $X_{1},\ldots,X_{N}$ be i.i.d. copies of $X$ . Then the adversary, maliciously (and knowing in advance of statistician’s intentions), is free to change at most $\eta N$ of the sample points. How accurately can $\mu=\mathbb{E}X$ be estimated with respect to the Euclidean norm? In particular, given $\delta$ and $\eta$ , does there exist an estimator and an absolute constant $c$ such that, regardless of the distribution of $X$ , with probability at least $1-\delta$ ,

$\|\widehat{\mu}-\mu\|\leq c\left(\sqrt{\frac{\mathrm{Tr}(\Sigma)}{N}}+\sqrt{\frac{\lambda_{1}\log(1/\delta)}{N}}+\sqrt{\lambda_{1}\eta}\right)~{}?$

(1.4)

The main result of this article, Theorem 2, answers this question in the affirmative. To that end, we construct a procedure, based on the one-dimensional trimmed-mean estimator, that has the desired performance guarantees.

Related work

The model of estimation under adversarial contamination has been extensively addressed in the literature of computational learning theory. Its origins may be traced back to the malicious noise model of Valiant [29] and Kearns and Li [16]. In the context of mean estimation it has been investigated by Diakonikolas, Kamath, Kane, Li, Moitra, and Stewart [9, 10, 11], Steinhardt, Charikar, and Valiant [25], Minsker [22]. In particular, in [10] it is shown that when $N=\Omega((d/\eta)\log d)$ and $\lambda_{1}$ is the largest eigenvalue of the covariance matrix $\Sigma$ of $X$ , then there exists a computationally efficient estimator of the mean that satisfies

[TABLE]

with probability at least $9/10$ for all distributions. Although this bound is sub-optimal in terms of the conditions and does not recover the sub-Gaussian bounds, the goal in [10], and in other articles in this direction as well, was mainly on computational efficiency. In contrast, our aim is to construct an estimator with optimal statistical performance, and the multivariate estimator we propose is not computationally feasible—at least in its naive implementation—in the sense that computing the estimator takes time that is exponential in the dimension. It is an intriguing problem to find computationally efficient mean estimators that have optimal statistical performance under the weakest possible assumptions: although such estimators are available for i.i.d. data from the results of Hopkins [14] and Cherapanamjeri, Flammarion, and Bartlett [5], these estimators are not expected to perform well under adversarial contamination.

The sub-Gaussian estimators achieving the bound (1.3) are based on median-of-means estimators. Such estimators have been studied under a (somewhat more restrictive) adversarial contamination model by Lecué and Lerasle [17] and by Minsker [22], see also see Rodriguez and Valdora [24]. In particular, Minsker [22] studies estimators that cleverly combine Huber’s robust $M$ -estimators with the median-of-means technique. His results imply a performance bound exactly of the form of (1.4). A disadvantage of Minsker’s estimator is that it assumes that the trace and operator norm of the covariance matrix are known up to a constant factor.

In a recent manuscript, Depersin and Lecué [7] study the problem of robust mean estimation a slightly more restrictive model of contamination. Their main result is a computationally efficient multivariate mean estimator that achieves a performance similar to (1.4), though only when $\eta$ is at most a small constant times $\log(1/\delta)/N$ ; thus, it is only able to handle low levels of contamination.

Chen, Gao, and Ren [3] develop a general theory of minimax bounds under Huber’s contamination model (i.e., when the contamination is i.i.d.) for parametric families of distributions. In [4] the same authors study robust estimation of the mean vector and covariance matrix under Huber’s contamination model and derive sharp minimax bounds for Gaussian, and more generally elliptical, distributions. In particular, they show that if the uncontaminated data is Gaussian with identity covariance matrix, then Tukey’s median $\widehat{\mu}$ satisfies that, with probability at least $1-\delta$ ,

[TABLE]

Moreover, they prove that this estimator is minimax optimal up to constant factors. Note that (1.4) has a similar form except that the term $\eta$ is replaced by the weaker $\sqrt{\eta}$ . It is remarkable that this is the only (necessary) price one has to pay for moving from Gaussian distributions to arbitrary ones whose covariance matrix exists and from Huber’s contamination to adversarial one. Moreover, as we argue below, for sub-Gaussian distributions the term $\sqrt{\eta}$ may be improved to $\eta\sqrt{\log(1/\eta)}$ . We also refer to Dalalyan and Thompson [6] for recent related work.

The rest of the article is organized as follows. In Section 2 we discuss the univariate case and establish a performance bound for a version of the trimmed-mean estimator in Theorem 1. We argue that this bound is best possible up to the value of the absolute constant. In Section 3 we extend the discussion to the multivariate case, and construct a new estimator. The proof of the performance bound of the multivariate estimator is given in Section 4.

2 The real-valued case

Let $X$ be a real-valued random variable that has finite variance $\sigma_{X}^{2}$ . Set $\mu=\mathbb{E}X$ and define $\overline{X}=X-\mu$ . In what follows, $c,C$ denote positive absolute constants whose value may change at each appearance. For $0<p<1$ , define the quantile

[TABLE]

For simplicity of presentation, we assume throughout the article that $X$ has an absolutely continuous distribution. Under this assumption, it follows that $\mathbb{P}\left(\overline{X}\geq Q_{p}(\overline{X})\right)=1-p$ . However, we emphasize that this assumption is not restrictive: one may easily adjust the proof to include all distributions with a finite second moment. Another solution is that the statistician can always add a small independent Gaussian noise to the sample points, thus ensuring that the distribution has a density and without affecting statistical performance.

For reasons of comparison, our starting point is a simple lower bound that limits the performance of every mean estimator. Similar arguments appear in [10] and [22].

While the adversary has total freedom to change at most $\eta N$ of the sample points, consider first a rather trivial action: changing the i.i.d. sample $(\mathcal{X}_{i})_{i=1}^{N}$ to $(\widetilde{X}_{i})_{i=1}^{N}$ defined by

[TABLE]

Since

[TABLE]

by a binomial tail bound, with probability at least $1-2\exp(-c\eta N)$ ,

[TABLE]

In particular, on this event, the adversary can change all sample points $X_{i}$ that are bigger than $\mu+Q_{1-\eta/2}(\overline{X})$ . As a result, there is no way one can determine whether $(\widetilde{X}_{i})_{i=1}^{N}$ is a corrupted sample, originally selected according to $X$ and then changed as in (2.2), or an uncorrupted sample selected according to the random variable

[TABLE]

Therefore, on this event, no procedure can distinguish between $\mathbb{E}X$ and $\mathbb{E}Z$ , which means that the error caused by this action is at least $|\mathbb{E}Z-\mu|$ . Note that for $M=Q_{1-\eta/2}(\overline{X})$ one has that

[TABLE]

Since the adversary can target the lower tail of $X$ in exactly the same way, it follows that, with probability at least $1-2\exp(-c\eta N)$ , no estimator can perform with accuracy better than

[TABLE]

Of course, the adversary has a second trivial action: do nothing. That is a better corruption strategy (in the minimax sense) when

[TABLE]

Therefore, if one wishes to find a procedure that performs with probability at least $1-\delta-2\exp(-c\eta N)$ , the best error one can hope for is

[TABLE]

where $c$ and $C$ are absolute constants.

A rather surprising fact is that in the real-valued case, the two trivial actions cause the largest possible damage. Indeed, we show that there is an estimator that is a simple modification of trimmed mean that attains what is almost the optimal error—with $\overline{\cal E}(\eta,X)$ replaced by

[TABLE]

Remark. It is straightforward to construct a random variable $X$ for which $\overline{\cal E}(\eta,X)\geq c_{1}\sqrt{\eta}\sigma_{X}$ . (Take, for example $X$ that takes value [math] with probability $1-\eta$ and values $\pm\sigma_{X}/\sqrt{\eta}$ with probability $\eta/2$ each.) Thus, in terms of $\eta,\sigma_{X},\delta$ and $N$ , the best minimax error rate that is possible in the corrupted mean estimation problem for real-valued random variables is

[TABLE]

for a suitable absolute constant $c$ .

Next, let us define the modified trimmed-estimator. The estimator splits the data into two equal parts. Half of the data points are used to determine the truncation at the appropriate level. The points from the other half are averaged as is, except for the data points that fall outside of the estimated quantiles, which are truncated prior to averaging. For convenience, assume that the data consists of $2N$ independent copies of the random variable $X$ , denoted by $X_{1},\ldots,X_{N},Y_{1},\ldots,Y_{N}$ . The statistician has access to the corrupted sample $\widetilde{X}_{1},\ldots,\widetilde{X}_{N},\widetilde{Y}_{1},\ldots,\widetilde{Y}_{N}$ , where at most $2\eta N$ of the sample points have been changed by an adversary.

For $\alpha\leq\beta$ , let

[TABLE]

and for $x_{1},\ldots,x_{m}\in\mathbb{R}$ let $x_{1}^{*}\leq x_{2}^{*}\leq\cdots\leq x_{m}^{*}$ be its non-decreasing rearrangement.

With this notation in place, the definition of the estimator is as follows:

Univariate mean estimator.

$(1)$ Consider the corrupted sample $\widetilde{X}_{1},\ldots,\widetilde{X}_{N},\widetilde{Y}_{1},\ldots,\widetilde{Y}_{N}$ as input.

$(2)$ Given the corruption parameter $\eta$ and confidence level $\delta$ , set

$\varepsilon=8\eta+12\frac{\log(4/\delta)}{N}~{}.$

$(3)$ Let $\alpha=\widetilde{Y}_{\varepsilon N}^{*}$ and $\beta=\widetilde{Y}_{(1-\varepsilon)N}^{*}$ and set

$\widehat{\mu}=\frac{1}{N}\sum_{i=1}^{N}\phi_{\alpha,\beta}(\widetilde{X}_{i})~{}.$

Theorem 1.

Let $\delta\in(0,1)$ be such that $\delta\geq e^{-N}/4$ . Then, with probability at least $1-\delta$ ,

[TABLE]

Moreover, with probability at least $1-4\exp(-\varepsilon N/12)$ ,

[TABLE]

Remark. The necessity of prior knowledge of the confidence parameter $\delta$ was pointed out (even in the contamination-free case) by Devroye, Lerasle, Lugosi, and Oliveira [8], see [20] for further discussion. The contamination level need not be known exactly. If an upper bound $\overline{\eta}\geq\eta$ is available and one uses the estimator with parameter $\overline{\eta}$ instead of $\eta$ , then the same bound holds with $\eta$ replaced by $\overline{\eta}$ .

To explain the meaning of Theorem 1, observe that for $M=Q_{1-\varepsilon/2}(\overline{X})$ , one has

[TABLE]

and in particular,

[TABLE]

Also,

[TABLE]

implying that for every $X$ ,

[TABLE]

Hence, Theorem 1 shows that the estimator attains the minimax rate of the corrupted mean-estimation problem, noted previously.

Of course, Theorem 1 actually implies sharper individual bounds: if $\eta N\leq\log(2/\delta)$ , then $\varepsilon\sim N^{-1}\log(2/\delta)$ and the assertion of Theorem 1 is that, with probability at least $1-\delta$ ,

[TABLE]

which matches the optimal sub-Gaussian error rate. If, on the other hand, $\eta N>\log(2/\delta)$ , then with probability at least $1-\delta$ ,

[TABLE]

essentially matching the lower bound (2.3).

Remark. Observe that the upper bound on ${\cal E}(\varepsilon,X)$ in (2.6) is based only on $\sigma_{X}$ , and therefore on the fact that $X$ is square-integrable. Under stronger moment assumptions on $X$ , an improved bound can be easily established. For example, if $X$ is sub-Gaussian, that is, if for every $p\geq 2$ , $\left(\mathbb{E}|\overline{X}|^{p}\right)^{1/p}\leq c\sqrt{p}\sigma_{X}$ , the same argument used in (2.6) for $p=\log(1/\varepsilon)$ shows that

[TABLE]

One may wonder if $\eta\sqrt{\log(1/\eta)}$ is the correct order of dependence on the contamination level for sub-Gaussian distributions. As it is proved by Chen, Gao, and Ren [4], if $X$ is Gaussian and the contamination comes from Huber’s model, the correct dependence on the contamination level is proportional to $\eta$ , suggesting a possible slight improvement. At the same time, as we discuss it above, $\overline{\mathcal{E}}(\eta,X)$ is a lower bound for any estimator. One may easily check that, if $X$ is Gaussian, $\overline{\mathcal{E}}(\eta,X)$ is of the order of $\eta/\sqrt{\log(1/\eta)}$ so this lower bound is loose in this case. Interestingly, however, there exist sub-Gaussian distributions under which $\overline{\mathcal{E}}(\eta,X)$ is of the order of $\eta\sqrt{\log(1/\eta)}$ . (As an example, one may take $X=\mathbbm{1}_{|G|\leq Q}\min(1,|G|)+\mathbbm{1}_{|G|>Q}|G|$ where $G$ is a standard Gaussian random variable and $Q$ is its $1-\eta/2$ quantile.) This means that for sub-Gaussian distributions, the upper bound of Theorem 1 is indeed tight, up to constant factors. Note that our lower bound uses the adversarial nature of the contamination, so it might be the case that under Huber’s model, even for sub-Gaussian distributions, $\eta$ is the correct order.

2.1 Proof of Theorem 1

Recall that one is given the corrupted sample $\widetilde{X}_{1},\ldots,\widetilde{X}_{N},\widetilde{Y}_{1},\ldots,\widetilde{Y}_{N}$ , out of which at most $2\eta N$ of the sample points have been corrupted. Also, $(z_{i}^{*})_{i=1}^{N}$ denotes a non-decreasing rearrangement of the sequence $(z_{i})_{i=1}^{N}$ .

The first step of the estimation procedure determines the truncation level, which is done using the first half of the corrupted sample.

Consider the corruption-free sample $Y_{1},\ldots,Y_{N}$ and let $U=\mathbbm{1}_{\overline{X}\geq Q_{1-2\varepsilon}(\overline{X})}$ . Since $X$ is absolutely continuous, we have that $\mathbb{P}\left(\overline{X}\geq Q_{1-2\varepsilon}(\overline{X})\right)=2\varepsilon$ and

[TABLE]

A straightforward application of Bernstein’s inequality shows that, with probability at least $1-\exp(-\varepsilon N/12)$ ,

[TABLE]

A similar argument for $U=\mathbbm{1}_{\overline{X}>Q_{1-\varepsilon/2}(\overline{X})}$ implies that, with probability at least $1-\exp(-\varepsilon N/12)$ ,

[TABLE]

Similarly, with probability at least $1-2\exp(-\varepsilon N/12)$ ,

[TABLE]

and, with probability at least $1-2\exp(-\varepsilon N/12)$ ,

[TABLE]

Thus, with probability at least $1-4\exp(-\varepsilon N/12)\geq 1-\delta/2$ , (2.7)–(2.10) hold simultaneously on an event we denote by $E$ . Importantly, the event $E$ only depends on the uncorrupted sample $Y_{1},\ldots,Y_{N}$ .

Since $\eta\leq\epsilon/8$ , following any corruption of at most $2\eta N$ points, on the event $E$

[TABLE]

and

[TABLE]

in other words,

[TABLE]

Similarly, on the event $E$ , we also have

[TABLE]

Recall that the truncation levels are

[TABLE]

To prove Theorem 1, first we show that $(1/N)\sum_{i=1}^{N}\phi_{\alpha,\beta}(X_{i})$ satisfies an inequality of the wanted form, and then we prove that corruption does not change the empirical mean of $\phi_{\alpha,\beta}$ by too much; that is, that

[TABLE]

is also small enough.

For the first step, note that on the event $E$ ,

[TABLE]

The first term on the right-hand side of (2.1) is bounded by

[TABLE]

On the other hand, since

[TABLE]

the second term on the right-hand side of (2.1) is a sum of centered i.i.d. random variables (independent of $E$ ) that are upper bounded by $Q_{1-\varepsilon/2}(\overline{X})+{\cal E}(4\varepsilon,X)$ and whose variance is at most $\sigma_{X}^{2}$ . Therefore, by Bernstein’s inequality, conditioned on $Y_{1},\ldots,Y_{n}$ , with probability at least $1-\delta/4$ ,

[TABLE]

where we used the fact that by (2.4), $Q_{1-\varepsilon/2}(\overline{X})\log(4/\delta)/N\leq\sigma_{X}\sqrt{\frac{\log(4/\delta)}{6N}}$ and that ${\cal E}(4\varepsilon,X)\log(4/\delta)/N\leq{\cal E}(4\varepsilon,X)$ by the assumption that $\delta\geq e^{-N}/4$ .

An identical argument for the lower tail shows that, on the event $E$ , with probability at least $1-\delta/2$ ,

[TABLE]

It remains to show that, on the event $E$ ,

[TABLE]

is small. Since $\phi_{\alpha,\beta}(X_{i})\not=\phi_{\alpha,\beta}(\widetilde{X}_{i})$ for at most $2\eta N$ indices, and for such points that maximal gap is

[TABLE]

it follows that

[TABLE]

since $\eta\leq\varepsilon/8$ . Finally, note that

[TABLE]

and therefore, on the event $E$ , we have

[TABLE]

The second statement of the theorem now follows by (2.6).

3 Robust multivariate mean estimation

In this section we present the main findings of the article: we construct a multivariate version of the robust mean estimator and establish the corresponding performance bound announced in the introduction.

As one may expect, the procedure in the multi-dimensional case is significantly more involved than in dimension one. In what follows, $X$ is a random vector taking values in $\mathbb{R}^{d}$ with mean $\mu=\mathbb{E}X$ and covariance matrix of $\Sigma$ . As before, we write $\overline{X}=X-\mu$ , $\lambda_{1}$ denotes the largest eigenvalue of $\Sigma$ , and $\mathrm{Tr}(\Sigma)=\mathbb{E}\left\|\overline{X}\right\|^{2}$ is its trace.

Recall that a mean estimator receives as data a sample $(\widetilde{X}_{i})_{i=1}^{N}$ that an adversary fabricates by corrupting at most $\eta N$ points of a sample $X_{1},\ldots,X_{N}$ of independent, identically distributed copies of the random vector $X$ . As in the univariate case, the estimator requires knowledge of the contamination level $\eta$ and the confidence parameter $\delta$ . Once again, for clarity of the presentation, we assume that $X$ has an absolutely continuous distribution with respect to the Lebesgue measure.

Theorem 2.

Assume that $X$ is a random vector in $\mathbb{R}^{d}$ that has a mean and covariance matrix. There exists a mean estimator $\widehat{\mu}$ that takes the parameters $\delta\in(0,1),\eta\in[0,1)$ and the contaminated data $(\widetilde{X}_{i})_{i=1}^{N}$ as input, and satisfies that, with probability at least $1-\delta$ ,

[TABLE]

where $c>0$ is a numerical constant.

A value of the numerical constant is explicitly given in the proof. However, no attempt has been made to optimize its value.

The same remark as in the univariate case on the previous knowledge of $\eta$ and $\delta$ , mentioned after Theorem 1, applies here as well.

As it is pointed out in the introduction, the bound of Theorem 2 coincides with the best possible bound in the corruption-free case up to the term $\sqrt{\lambda_{1}\eta}$ that is the price one has to pay for adversarial corruption. The fact that the term $\sqrt{\lambda_{1}\eta}$ is inevitable in the upper bound follows from the fact that for any upper bound for the norm of difference $\|\widehat{\mu}-\mu\|$ , the same upper bound holds for any one-dimensional marginal. Hence, the necessity of this term follows from our arguments in the univariate case. At the same time, similarly to the univariate case, under higher moment assumptions, the term $\sqrt{\lambda_{1}\eta}$ may be improved. For instance, if the distribution is sub-Gaussian (in the sense that all one-dimensional projections are sub-Gaussian), then this term may be replaced by $\eta\sqrt{\log(1/\eta)}\sqrt{\lambda_{1}}$ . This may be seen by a straightforward modification of the proof.

Remarkably, the malicious sample corruption affects only the “weak” term of the bound, that is, it scales with the square root of the operator norm of the covariance matrix. Indeed, if the corruption parameter $\eta$ is such that $\eta N\leq\log(2/\delta)$ , then, with probability at least $1-\delta$ , $\widehat{\mu}$ satisfies

[TABLE]

matching the optimal bound for multivariate mean estimation bound from [21] for the corruption-free case. If, on the other hand, the corruption parameter is larger, then Theorem 2 implies that with probability at least $1-2\exp(-\eta N/c)$ ,

[TABLE]

for a numerical constant $c>0$ .

In what follows we describe the construction of the mean estimator $\widehat{\mu}$ that satisfies the announced performance bound.

3.1 The multivariate mean estimator

The main component is a mean estimation procedure that, in order to perform well, requires information on $\mathrm{Tr}(\Sigma)$ and $\lambda_{1}$ . Since such information is not assumed to be available, we produce an estimator depending on a tuning parameter $Q$ . Then we use a simple mechanism of choosing the appropriate value of $Q$ .

Just like in the univariate case, for simplicity of notation, assume that the estimator receives $2N$ data points $\widetilde{X}_{1},\ldots,\widetilde{X}_{N},\widetilde{Y}_{1},\ldots,\widetilde{Y}_{N}$ , and that at most $2\eta N$ points of the original independent sample $X_{1},\ldots,X_{N},Y_{1},\ldots,Y_{N}$ have been changed by the adversary. The procedure computes, for each unit vector $v$ and tuning parameter $Q>0$ , the trimmed mean estimate of the expectation of the projection of $X$ to the line spanned by $v$ with a minor difference: the truncation level is widened depending on the parameter $Q$ . Each one of these estimators defines a slab in $\mathbb{R}^{d}$ . The details are as follows:

Multivariate mean estimator.

$(1)$ Set

$\varepsilon=\max\left(10\eta,2560\frac{\log(2/\delta)}{N}\right)~{}.$

$(2)$ Let $S^{d-1}$ be the Euclidean unit sphere in $\mathbb{R}^{d}$ and for every $v\in S^{d-1}$ define

$\alpha_{v}=\left(\left\langle\widetilde{Y}_{i},v\right\rangle\right)_{(\varepsilon/2)N}^{*}\quad{\rm and}\quad\beta_{v}=\left(\left\langle\widetilde{Y}_{i},v\right\rangle\right)_{(1-\varepsilon/2)N}^{*}~{}.$

$(3)$ For every $v\in S^{d-1}$ and $Q>0$ , set

$U_{Q}(v)=\frac{1}{N}\sum_{i=1}^{N}\phi_{\alpha_{v}-Q,\beta_{v}+Q}\left(\left\langle\widetilde{X}_{i},v\right\rangle\right)~{},$

and let

$\Gamma(v,Q)=\left\{x\in\mathbb{R}^{d}:|\left\langle x,v\right\rangle-U_{Q}(v)|\leq 2\varepsilon Q\right\}~{}.$

$(4)$ For each $Q>0$ , set

$\Gamma(Q)=\bigcap_{v\in S^{d-1}}\Gamma(v,Q)~{}.$

$(5)$ Let $i^{*}\in\mathbb{Z}$ be the smallest such that $\bigcap_{i\geq i^{*}}\Gamma(2^{i})\neq\emptyset$ . Define $\widehat{\mu}$ to be any point in

$\bigcap_{i\in\mathbb{Z}:i\geq i^{*}}\Gamma(2^{i})~{}.$

Each set $\Gamma(Q)$ is an intersection of random slabs, one for each direction in the sphere $S^{d-1}$ . The “center” of the slab associated with the direction $v$ is $U_{Q}(v)$ and its width is proportional to $\varepsilon Q$ . As we show in what follows, there is some $i_{0}\in\mathbb{Z}$ such that with probability at least $1-\delta$ , the sets $\Gamma(2^{i}),\ i\geq i_{0}$ are nested, implying that $\widehat{\mu}$ is well-defined. Note that the last step of selecting the value of $Q$ is reminiscent of Lepski’s method [19] or the related method “intersection of confidence intervals” by Goldenshluger and Nemirovski [12].

4 Proof of Theorem 2

The heart of the proof of Theorem 2 is the following proposition that describes the performance of an estimator with the correct tuning parameter $Q$ .

The role of $Q$ is to incorporate the “global complexity” of $S^{d-1}$ . In particular, if $Q$ is selected properly, that is enough to ensure that $\Gamma(Q)$ is nonempty and contains a good estimator of $\mu$ . This is formalized in the next proposition.

Proposition 1.

Let

[TABLE]

and consider $Q\in[2Q_{0},4Q_{0}]$ . Then, with probability at least $1-2\exp(-\varepsilon N/2560)\geq 1-\delta$ , $\Gamma(Q)\not=\emptyset$ and for every $z\in\Gamma(Q)$ ,

[TABLE]

Observe that for every $Q$ , the diameter of $\Gamma(Q)$ is at most $4\varepsilon Q$ . Indeed, if $x_{1},x_{2}\in\Gamma(Q)$ then for every $v\in S^{d-1}$ ,

[TABLE]

implying that $\|x_{1}-x_{2}\|\leq 4\varepsilon Q$ .

The key component in the proof of Proposition 1 is the next lemma.

Lemma 1.

For each $i\in\{1,\ldots,N\}$ and $v\in S^{N-1}$ , define $\overline{Y}_{i}(v)=\left\langle Y_{i}-\mu,v\right\rangle$ . With probability at least $1-\exp(-\varepsilon N/2560)\geq 1-\delta/2$ ,

[TABLE]

Lemma 1 is a uniform version of the analogous claim used in the univariate case.

Proof. Let us prove the first inequality; the second is proved by an identical argument and is omitted. Consider the function $\chi:\mathbb{R}\to\mathbb{R}$ , defined by

[TABLE]

Observe that $\mathbbm{1}_{\{\overline{Y}(v)\geq Q_{0}\}}\leq\chi(\overline{Y}(v))\leq\mathbbm{1}_{\{\overline{Y}(v)\geq Q_{0}/2\}}$ , and that $\chi$ is Lipschitz with constant $2/Q_{0}$ . Therefore, if $\varepsilon_{1},\ldots,\varepsilon_{N}$ are independent, symmetric $\{-1,1\}$ -valued random variables that are independent of the $(Y_{i})_{i=1}^{N}$ , then

[TABLE]

where in the second step one uses the standard contraction lemma for Rademacher averages, see Ledoux and Talagrand [18].

To bound the second term on the right-hand side, recall that $Q_{0}\geq 16\sqrt{\lambda_{1}/\varepsilon}$ , and thus, for every $v\in S^{d-1}$ ,

[TABLE]

To bound the first term, note that

[TABLE]

Hence, by the definition of $Q_{0}$ ,

[TABLE]

By Talagrand’s concentration inequality for empirical processes indexed by a class of uniformly bounded functions [27], with probability at least $1-\exp(-x)$ ,

[TABLE]

(see [2, Exercise 12.15] for the value of the numerical constant).

With the choice of $x=\varepsilon N/2560$ one has that, with probability at least $1-\exp(-\varepsilon N/2560)$ ,

[TABLE]

as required.

Note that, when (4.2) holds, we have, for every $v\in S^{d-1}$ ,

[TABLE]

Indeed, this follows from the fact that for every $v\in S^{d-1}$ there are at most $(\varepsilon/8)N$ of the $\overline{Y}_{i}(v)$ that are larger than $Q_{0}$ . If, in addition, the adversary corrupts at most $(\varepsilon/8)N$ of the points $Y_{i}$ , then there are still no more than $(\varepsilon/4)N$ values $\left\langle\widetilde{Y}_{i},v\right\rangle$ that are larger than $\left\langle\mu,v\right\rangle+Q_{0}$ , which suffices for our purposes. And, by the definition of $\varepsilon$ , one has that $\varepsilon/8\geq\eta$ , as required.

Now consider some $Q$ that satisfies $2Q_{0}<Q\leq 4Q_{0}$ , and from here we condition on an event $E$ such that the inequalities (4.2) both hold. By Lemma 1, $E$ occurs with probability at least $1-\exp(-\varepsilon N/2560)$ ; importantly, this event only depends on $Y_{1},\ldots,Y_{N}$ , the first half of the uncontaminated sample.

In particular, on the event $E$ , for every $v\in S^{d-1}$ ,

[TABLE]

and

[TABLE]

By a similar argument one may obtain lower and upper bounds for $\alpha_{v}-\left\langle\mu,v\right\rangle$ . Hence, on $E$ , for every $v\in S^{d-1}$ ,

[TABLE]

Finally, recall that

[TABLE]

and in order to complete the proof of Proposition 1, it suffices to show that $U_{Q}(v)$ is uniformly close to $\left\langle\mu,v\right\rangle$ , with high probability. In particular, the next lemma implies Proposition 1.

Lemma 2.

Let $2Q_{0}\leq Q\leq 4Q_{0}$ . Conditioned on the event $E$ , with probability at least $1-2\exp(-\varepsilon N/2560)$ ,

[TABLE]

Proof. We prove that

[TABLE]

holds with the wanted probability; the proof that

[TABLE]

follows an identical argument and is omitted.

As a first step, note that, in the expression of $U_{Q}(v)$ , the corrupted samples $\widetilde{X}_{i}$ may be harmlessly replaced by their uncorrupted counterparts $X_{i}$ . Indeed, by (4.4), on the event $E$ , the range of the function $\phi_{\alpha_{v}-Q,\beta_{v}+Q}$ is an interval of length at most $10Q$ and therefore, deterministically, for all $v\in S^{d-1}$ ,

[TABLE]

Once again, recalling that on $E$ (4.4) holds, it follows that

[TABLE]

Since the event $E$ only depends on the uncorrupted sample $Y_{1},\ldots,Y_{N}$ , the right-hand side of the above inequality is independent of $E$ . Thus, writing

[TABLE]

it suffices to prove that, with probability at least $1-2e^{-\varepsilon N/2560}$ ,

[TABLE]

To that end, consider the decomposition

[TABLE]

First, let us bound the term $(1)$ in several steps.

Set

[TABLE]

and note that

[TABLE]

To bound term $(a)$ , recall that $2Q_{0}\leq Q\leq 4Q_{0}$ , implying that $\phi_{-Q_{0},5Q_{0}}(x)\not=\phi_{-3Q,3Q}(x)$ only if

[TABLE]

In both cases

[TABLE]

By Lemma 1, with probability at least $1-\exp(-\varepsilon N/2560)$ ,

[TABLE]

hence, on this event,

[TABLE]

One may control term $(c)$ similarly. For each $v\in S^{d-1}$ ,

[TABLE]

by recalling (4).

The term $(b)$ is controlled using Talagrand’s concentration inequality for the supremum of empirical processes. Note that for every $v\in S^{d-1}$ ,

[TABLE]

Also, since $\phi_{-3Q,3Q}(x)$ is a $1$ -Lipschitz function that passes through [math], by a contraction argument (see Ledoux and Talagrand [18]),

[TABLE]

Hence, by Talagrand’s inequality, with probability at least $1-2\exp(-x)$ ,

[TABLE]

with the choice of $x=\varepsilon N/2560$ , recalling the definition of $Q_{0}$ , and using that $Q\geq 2Q_{0}$ . This concludes the proof that $(1)\leq(1/2+1/32+1/400)\varepsilon Q$ with probability $1-e^{-\varepsilon N/2560}$ .

Finally, it remains to estimate term $(2)$ :

[TABLE]

Clearly $X_{v}=\left\langle\overline{X},v\right\rangle$ is centered and $\phi_{-Q_{0},5Q_{0}}(X_{v})\not=X_{v}$ only when either $X_{v}\geq 5Q_{0}$ or $X_{v}\leq-Q_{0}$ . Hence,

[TABLE]

by an argument analogous to (2.5) and using (4).

With Proposition 1 proved, let us complete the proof of Theorem 2. Let $i_{0}$ be such that $Q\stackrel{{\scriptstyle\mathrm{def.}}}{{=}}2^{i_{0}}\in[2Q_{0},4Q_{0})$ and let $E$ be the “good” event that both (4.2) and

[TABLE]

hold. Recall that

[TABLE]

$E$ holds with probability at least $1-\delta$ ; and on $E$ , any point in $\Gamma(2^{i_{0}})$ is within distance $4\varepsilon Q_{0}$ of the mean $\mu$ . Hence, it suffices to show that on the event $E$ , the sets $\Gamma(2^{i})$ for $i\geq i_{0}$ are nested. Indeed, by the definition of $i^{*}$ ,

[TABLE]

and thus $\|\widehat{\mu}-\mu\|\leq 4\varepsilon Q_{0}$ .

To see that $\Gamma(2^{i_{0}})\subset\Gamma(2^{i_{0}+1})$ it is enough to show that, for all $v\in S^{d-1}$ , $|\left\langle x,v\right\rangle-U_{2Q}(v)|\leq 4\varepsilon Q$ . But if $x\in\Gamma(v,Q)$ for some $v\in S^{d-1}$ , it follows that

[TABLE]

therefore, it suffices to show that $|U_{Q}(v)-U_{2Q}(v)|\leq 2\varepsilon Q$ .

Note that on the event $E$ , there are at most $\varepsilon N/4$ sample points $\widetilde{X}_{i}$ such that $\left\langle\widetilde{X}_{i},v\right\rangle$ is above or below the levels $\alpha_{v}-2^{i_{0}}$ and $\beta_{v}+2^{i_{0}}$ . Hence, the number of points for which $U_{Q}(v)\not=U_{2Q}(v)$ is at most $\varepsilon N/4$ and so the difference is at most $(2Q\varepsilon N/4)/N=\varepsilon Q/2$ .

By induction, the same argument shows that, on the event $E$ , $\Gamma(2^{i})\subset\Gamma(2^{i+1})$ for every $i\geq i_{0}$ , completing the proof of Theorem 2.

Acknowledgements

We thank the referees and the associate editor for insightful comments and pointing out relevant connections to previous work.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P.J. Bickel. On some robust estimates of location. The Annals of Mathematical Statistics , 36:847–858, 1965.
2[2] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A Nonasymptotic Theory of Independence . Oxford University Press, 2013.
3[3] M. Chen, C.Gao, and Z. Ren. A general decision theory for Huber’s ϵ italic-ϵ \epsilon -contamination model. Electronic Journal of Statistics , 10(2):3752–3774, 2016.
4[4] M. Chen, C.Gao, and Z. Ren. Robust covariance and scatter matrix estimation under Huber’s contamination model. The Annals of Statistics , 46(5):1932–1960, 2018.
5[5] Y. Cherapanamjeri, N. Flammarion, and P. Bartlett. Fast mean estimation with sub-gaussian rates. ar Xiv preprint ar Xiv:1902.01998 , 2019.
6[6] A. Dalalyan and P. Thompson. Outlier-robust estimation of a sparse linear model using ℓ 1 subscript ℓ 1 \ell_{1} -penalized Huber’s M 𝑀 M -estimator. In Advances in Neural Information Processing Systems , 13188–13198. 2019.
7[7] J. Depersin and G. Lecué. Robust subgaussian estimation of a mean vector in nearly linear time. ar Xiv preprint ar Xiv:1906.03058 , 2019.
8[8] L. Devroye, M. Lerasle, G. Lugosi, and R.I. Oliveira. Sub-Gausssian mean estimators. Annals of Statistics , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Robust multivariate mean estimation: the optimality of trimmed mean

Abstract

keywords:

keywords:

1 Introduction

Related work

2 The real-valued case

Theorem 1**.**

2.1 Proof of Theorem 1

3 Robust multivariate mean estimation

Theorem 2**.**

3.1 The multivariate mean estimator

4 Proof of Theorem 2

Proposition 1**.**

Lemma 1**.**

Lemma 2**.**

Acknowledgements

Theorem 1.

Theorem 2.

Proposition 1.

Lemma 1.

Lemma 2.