Distributed Statistical Estimation and Rates of Convergence in Normal   Approximation

Stanislav Minsker; Nate Strawn

arXiv:1704.02658·math.ST·August 29, 2018

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

Stanislav Minsker, Nate Strawn

PDF

TL;DR

This paper introduces new distributed statistical estimation algorithms leveraging divide-and-conquer strategies, establishing their convergence rates and robustness, with applications to median-of-means and maximum likelihood estimators.

Contribution

It develops novel algorithms for distributed estimation, linking their performance to normal approximation rates, and provides non-asymptotic deviation bounds and limit theorems.

Findings

01

New bounds for median-of-means estimator in distributed settings

02

Performance guarantees for distributed maximum likelihood estimation

03

Robustness of divide-and-conquer algorithms in large systems

Abstract

This paper presents a class of new algorithms for distributed statistical estimation that exploit divide-and-conquer approach. We show that one of the key benefits of the divide-and-conquer strategy is robustness, an important characteristic for large distributed systems. We establish connections between performance of these distributed algorithms and the rates of convergence in normal approximation, and prove non-asymptotic deviations guarantees, as well as limit theorems, for the resulting estimators. Our techniques are illustrated through several examples: in particular, we obtain new results for the median-of-means estimator, as well as provide performance guarantees for distributed maximum likelihood estimation.

Equations376

θ^{(k)} = \mbox a r g min_{z \in R} j = 1 \sum k ρ (∣ \overset{ˉ}{θ}_{j} - z ∣)

θ^{(k)} = \mbox a r g min_{z \in R} j = 1 \sum k ρ (∣ \overset{ˉ}{θ}_{j} - z ∣)

θ^{(k)} = \mbox a r g min_{y \in R^{m}} j = 1 \sum k ρ (\overset{ˉ}{θ}_{j} - y_{\circ})

θ^{(k)} = \mbox a r g min_{y \in R^{m}} j = 1 \sum k ρ (\overset{ˉ}{θ}_{j} - y_{\circ})

θ^{(k)} = \mbox m e d (\overset{ˉ}{θ}_{1}, \dots, \overset{ˉ}{θ}_{k}),

θ^{(k)} = \mbox m e d (\overset{ˉ}{θ}_{1}, \dots, \overset{ˉ}{θ}_{k}),

θ^{(k)} - θ_{*} \leq 2 σ 6 e \frac{k}{N}

θ^{(k)} - θ_{*} \leq 2 σ 6 e \frac{k}{N}

∣ θ^{(k)} - θ_{*} ∣ \leq 3 σ (\frac{E ∣ X - θ _{*} ∣ ^{3}}{σ ^{3}} \frac{k}{N - k} + \frac{s}{N - k})

∣ θ^{(k)} - θ_{*} ∣ \leq 3 σ (\frac{E ∣ X - θ _{*} ∣ ^{3}}{σ ^{3}} \frac{k}{N - k} + \frac{s}{N - k})

\mbox a r g min_{z \in R^{d}} f (z) = {z \in R^{d} : f (z) \leq f (x) for all x \in R^{d}},

\mbox a r g min_{z \in R^{d}} f (z) = {z \in R^{d} : f (z) \leq f (x) for all x \in R^{d}},

g_{j} (n_{j}) := t \in R sup P (\frac{θ ˉ _{j} - θ _{*}}{σ _{n_{j}}^{(j)}} \leq t) - Φ (t) \to 0 as n_{j} \to \infty.

g_{j} (n_{j}) := t \in R sup P (\frac{θ ˉ _{j} - θ _{*}}{σ _{n_{j}}^{(j)}} \leq t) - Φ (t) \to 0 as n_{j} \to \infty.

H_{k} := (\frac{1}{k} j = 1 \sum k \frac{1}{σ _{n_{j}}^{(j)}})^{- 1}

H_{k} := (\frac{1}{k} j = 1 \sum k \frac{1}{σ _{n_{j}}^{(j)}})^{- 1}

θ^{(k)} = \mbox m e d (\overset{ˉ}{θ}_{1}, \dots, \overset{ˉ}{θ}_{k}) .

θ^{(k)} = \mbox m e d (\overset{ˉ}{θ}_{1}, \dots, \overset{ˉ}{θ}_{k}) .

\frac{1}{k} i = 1 \sum k (g_{i} (n_{i}) + \frac{s}{k}) \cdot j = 1, \dots, k max α_{j} < \frac{1}{2} .

\frac{1}{k} i = 1 \sum k (g_{i} (n_{i}) + \frac{s}{k}) \cdot j = 1, \dots, k max α_{j} < \frac{1}{2} .

Φ (ζ_{j} (n_{j}, s) / σ_{n_{j}}^{(j)}) - \frac{1}{2} = α_{j} \cdot \frac{1}{k} i = 1 \sum k (g_{i} (n_{i}) + \frac{s}{k}) .

Φ (ζ_{j} (n_{j}, s) / σ_{n_{j}}^{(j)}) - \frac{1}{2} = α_{j} \cdot \frac{1}{k} i = 1 \sum k (g_{i} (n_{i}) + \frac{s}{k}) .

θ^{(k)} - θ_{*} \leq ζ (s) := j = 1, \dots, k max ζ_{j} (n_{j}, s)

θ^{(k)} - θ_{*} \leq ζ (s) := j = 1, \dots, k max ζ_{j} (n_{j}, s)

ζ (s) \leq 3 H_{k} \cdot \frac{1}{k} j = 1 \sum k (g_{j} (n_{j}) + \frac{s}{k}) .

ζ (s) \leq 3 H_{k} \cdot \frac{1}{k} j = 1 \sum k (g_{j} (n_{j}) + \frac{s}{k}) .

H_{k} \leq \frac{k}{⌊ k / m ⌋} \cdot \frac{1}{⌊ k / m ⌋} j = 1 \sum ⌊ k / m ⌋ \overset{σ}{ˉ}^{(j)}

H_{k} \leq \frac{k}{⌊ k / m ⌋} \cdot \frac{1}{⌊ k / m ⌋} j = 1 \sum ⌊ k / m ⌋ \overset{σ}{ˉ}^{(j)}

∣ θ^{(k)} - θ_{*} ∣ \leq σ (1.43 \frac{E ∣ X - θ _{*} ∣ ^{3} / σ ^{3}}{n} + 3 \frac{s}{k n})

∣ θ^{(k)} - θ_{*} ∣ \leq σ (1.43 \frac{E ∣ X - θ _{*} ∣ ^{3} / σ ^{3}}{n} + 3 \frac{s}{k n})

g_{j} (n) \leq c_{n} = 0.4748 \frac{E ∣ X - θ _{*} ∣ ^{3}}{σ ^{3} n}

g_{j} (n) \leq c_{n} = 0.4748 \frac{E ∣ X - θ _{*} ∣ ^{3}}{σ ^{3} n}

∣ θ^{(k)} - θ_{*} ∣ \leq c_{2} σ (\frac{E ∣ X - θ _{*} ∣ ^{2 + δ} / σ ^{2 + δ}}{n ^{\frac{1 + δ}{2}}} + \frac{s}{N}) .

∣ θ^{(k)} - θ_{*} ∣ \leq c_{2} σ (\frac{E ∣ X - θ _{*} ∣ ^{2 + δ} / σ ^{2 + δ}}{n ^{\frac{1 + δ}{2}}} + \frac{s}{N}) .

θ^{(k)} - θ_{*} \leq \frac{3}{I ( θ _{*} )} (\frac{C}{n} + \frac{c}{n} γ^{n} + \frac{s}{k n})

θ^{(k)} - θ_{*} \leq \frac{3}{I ( θ _{*} )} (\frac{C}{n} + \frac{c}{n} γ^{n} + \frac{s}{k n})

j = 1, \dots, k max ζ_{j} (n, s) \leq 3 (\frac{C}{n} + c γ^{n} + s / k),

j = 1, \dots, k max ζ_{j} (n, s) \leq 3 (\frac{C}{n} + c γ^{n} + s / k),

θ_{ρ}^{(k)} := \mbox a r g min_{z \in R} j = 1 \sum k ρ (z - \overset{ˉ}{θ}_{j}) .

θ_{ρ}^{(k)} := \mbox a r g min_{z \in R} j = 1 \sum k ρ (z - \overset{ˉ}{θ}_{j}) .

ρ_{M} (z) = {z^{2} /2, M ∣ z ∣ - M^{2} /2, ∣ z ∣ \leq M, ∣ z ∣ > M,

ρ_{M} (z) = {z^{2} /2, M ∣ z ∣ - M^{2} /2, ∣ z ∣ \leq M, ∣ z ∣ > M,

H_{k}

H_{k}

j = 1, \dots, k max α_{j} e^{(C_{ρ} / σ_{n_{j}}^{(j)})^{2}} \frac{1}{k} i = 1 \sum k (\frac{s}{k} + 2 g_{i} (n_{i})) \leq 0.33.

j = 1, \dots, k max α_{j} e^{(C_{ρ} / σ_{n_{j}}^{(j)})^{2}} \frac{1}{k} i = 1 \sum k (\frac{s}{k} + 2 g_{i} (n_{i})) \leq 0.33.

θ_{ρ}^{(k)} - θ_{*} \leq 3 H_{k} j = 1, \dots, k max e^{(C_{ρ} / σ_{n_{j}}^{(j)})^{2}} \cdot \frac{1}{k} i = 1 \sum k (\frac{s}{k} + 2 g_{i} (n_{i}))

θ_{ρ}^{(k)} - θ_{*} \leq 3 H_{k} j = 1, \dots, k max e^{(C_{ρ} / σ_{n_{j}}^{(j)})^{2}} \cdot \frac{1}{k} i = 1 \sum k (\frac{s}{k} + 2 g_{i} (n_{i}))

σ_{n_{1}} = \frac{1}{Φ ^{- 1} ( 0.75 )} \mbox m e d (∣ \overset{ˉ}{θ}_{1} - \mbox m e d (\overset{ˉ}{θ}_{1}, \dots, \overset{ˉ}{θ}_{k}) ∣, \dots, ∣ \overset{ˉ}{θ}_{k} - \mbox m e d (\overset{ˉ}{θ}_{1}, \dots, \overset{ˉ}{θ}_{k}) ∣),

σ_{n_{1}} = \frac{1}{Φ ^{- 1} ( 0.75 )} \mbox m e d (∣ \overset{ˉ}{θ}_{1} - \mbox m e d (\overset{ˉ}{θ}_{1}, \dots, \overset{ˉ}{θ}_{k}) ∣, \dots, ∣ \overset{ˉ}{θ}_{k} - \mbox m e d (\overset{ˉ}{θ}_{1}, \dots, \overset{ˉ}{θ}_{k}) ∣),

θ_{ρ}^{(k)} := \mbox a r g min_{z \in R} j = 1 \sum k ρ (\frac{z - θ ˉ _{j}}{σ _{n}}),

θ_{ρ}^{(k)} := \mbox a r g min_{z \in R} j = 1 \sum k ρ (\frac{z - θ ˉ _{j}}{σ _{n}}),

L (z) := E ρ^{'} (z + Z),

L (z) := E ρ^{'} (z + Z),

k \frac{θ _{ρ}^{(k)} - θ _{*}}{σ _{n}} d N (0, Δ^{2}),

k \frac{θ _{ρ}^{(k)} - θ _{*}}{σ _{n}} d N (0, Δ^{2}),

N (θ^{(k)} - θ_{*}) d N (0, \frac{π}{2} σ^{2}) .

N (θ^{(k)} - θ_{*}) d N (0, \frac{π}{2} σ^{2}) .

ρ^{'} (x) = ⎩ ⎨ ⎧ - 1, 0, 1, x < 0, x = 0, x > 0,

ρ^{'} (x) = ⎩ ⎨ ⎧ - 1, 0, 1, x < 0, x = 0, x > 0,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

Stanislav Minsker 111S. Minsker was partially supported by the National Science Foundation grant DMS-1712956.

[email protected]

Nate Strawn

[email protected]

Abstract

This paper presents a class of new algorithms for distributed statistical estimation that exploit divide-and-conquer approach. We show that one of the key benefits of the divide-and-conquer strategy is robustness, an important characteristic for large distributed systems. We establish connections between performance of these distributed algorithms and the rates of convergence in normal approximation, and prove non-asymptotic deviations guarantees, as well as limit theorems, for the resulting estimators. Our techniques are illustrated through several examples: in particular, we obtain new results for the median-of-means estimator, as well as provide performance guarantees for distributed maximum likelihood estimation.

1 Introduction.

According to (IBM, 2015), “Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos.. to name a few. This data is big data”. Novel scalable and robust algorithms are required to successfully address the challenges posed by big data problems. This paper develops and analyzes techniques that exhibit scalability, a necessary characteristic of modern methods designed to perform statistical analysis of large datasets, as well as robustness that guarantees stable performance of distributed systems when some of the nodes exhibit abnormal behavior.

The computational power of a single computer is often insufficient to store and process modern data sets, and instead data is stored and analyzed in a distributed way by a cluster consisting of several machines. We consider a distributed estimation framework wherein data is assumed to be randomly assigned to computational nodes that produce intermediate results. We assume that no communication between the nodes is allowed at this first stage. On the second stage, these intermediate results are used to compute some statistic on the whole dataset; see figure 1 for a graphical illustration.

Often, such a distributed setting is unavoidable in applications, whence interactions between subsamples stored on different machines are inevitably lost. Most previous research focused on the following question: how significantly does this loss affect the quality of statistical estimation when compared to an “oracle” that has access to the whole sample? The question that we ask in this paper is different: what can be gained from randomly splitting the data across several subsamples? What are the statistical advantages of the divide-and-conquer framework? Our work indicates that one of the key benefits of an appropriate merging strategy is robustness. In particular, the quality of estimation attained by the distributed estimation algorithm is preserved even if a subset of machines stops working properly. At the same time, the resulting estimators admit tight probabilistic guarantees (expressed in the form of exponential concentration inequalities) even when the distribution of the data has heavy tails – a viable model of real-world samples contaminated by outliers.

We establish connections between a class of randomized divide-and-conquer strategies and the rates of convergence in normal approximation. Using these connections, we provide a new analysis of the “median-of-means” estimator which often yields significant improvements over the previously available results. We further illustrate the implications of our results by constructing novel algorithms for distributed Maximum Likelihood Estimation that admit strong performance guarantees under weak assumptions on the underlying distribution.

1.1 Background and related work.

We begin by introducing a simple model for distributed statistical estimation. Let $X_{1},\ldots,X_{N}$ be a sequence of independent random variables with values in a measurable space $(S,\mathcal{S})$ that represent the data available to a statistician. We will assume that $N$ is large, and that that the sample $\mathcal{X}=(X_{1},\ldots,X_{N})$ is partitioned into $k$ disjoint subsets $G_{1},\ldots,G_{k}$ of cardinalities $n_{j}:=\mathrm{card}(G_{j})$ respectively, where the partitioning scheme is independent of the data. Let $P_{j}$ be the distribution of $X_{j}$ , $j=1,\ldots,N$ . The goal is to estimate an unknown parameter $\theta_{\ast}=\theta_{\ast}(P_{j}),\ j=1,\ldots,N$ shared by $P_{1},\ldots,P_{N}$ and taking values in a separable Hilbert space $(\mathbb{H},\|\cdot\|_{\mathbb{H}})$ ; for example, if $S=\mathbb{H}$ , $\theta_{\ast}$ could be the common mean of $X_{1},\ldots,X_{N}$ . Distributed estimation protocol proceeds via performing “local” computations on each subset $G_{j},\ j\leq k$ , and the local estimators $\bar{\theta}_{j}:=\bar{\theta}_{j}(G_{j}),\ j\leq k$ are then pieced together to produce the final “global” estimator $\hat{\theta}^{(k)}=\hat{\theta}^{(k)}(\bar{\theta}_{1},\ldots,\bar{\theta}_{k})$ . We are interested in the statistical properties of such distributed estimation protocols, and our main focus is on the final step that combines the local estimators. Let us mention that the condition requiring the sets $G_{j},\ 1\leq j\leq k$ to be disjoint can be relaxed; we discuss the extensions related to U-quantiles in section 2.6 below.

The problem of distributed and communication - efficient statistical estimation has recently received significant attention from the research community. While our review provides only a subsample of the abundant literature in this field, it is important to acknowledge the works by Mcdonald et al. (2009); Zhang et al. (2012); Fan et al. (2014); Battey et al. (2015); Duchi et al. (2014); Shafieezadeh-Abadeh et al. (2015); Lee et al. (2015); Cheng and Shang (2015); Rosenblatt and Nadler (2016); Zinkevich et al. (2010). Li et al. (2016); Scott et al. (2016); Shang and Cheng (2015); Minsker et al. (2014) have investigated closely related problems for distributed Bayesian inference. Applications to important algorithms such as Principal Component Analysis were investigated in (Fan et al., 2017; Liang et al., 2014), among others. Jordan (2013), author provides an overview of recent trends in the intersection of the statistics and computer science communities, describes popular existing strategies such as the “bag of little bootstraps”, as wells as successful applications of the divide-and-conquer paradigm to problems such as matrix factorization.

The majority of the aforementioned works propose averaging of local estimators as a final merging step. Indeed, averaging reduces variance, hence, if the bias of each local estimator is sufficiently small, their average often attains optimal rates of convergence to the unknown parameter $\theta_{\ast}$ . For example, when $\theta_{\ast}(P)=\mathbb{E}_{P}X$ is the mean of $X$ and $\bar{\theta}_{j}$ is the sample mean evaluated over the subsample $G_{j},\ j=1,\ldots,k$ , then the average of local estimators $\tilde{\theta}=\frac{1}{k}\sum_{j=1}^{k}\bar{\theta}_{j}$ is just a empirical mean evaluated over the whole sample. More generally, it has been shown by Battey et al. (2015); Zhang et al. (2013) that in many problems (for instance, linear regression), $k$ can be taken as large as $O(\sqrt{N})$ without negatively affecting the estimation rates; similar guarantees hold for a variety of M-estimators (see Rosenblatt and Nadler, 2016). However, if the number of nodes $k$ itself is large (the case we are mainly interested in), then the averaging scheme has a drawback: if one or more among the local estimators $\bar{\theta}_{j}$ ’s is anomalous (for example, due to data corruption or a computer system malfunctioning), then statistical properties of the average will be negatively affected as well. For large distributed systems, this drawback can be costly.

One way to address this issue is to replace averaging by a more robust procedure, such as the median or a robust M-estimator; this approach is investigated in the present work. In the univariate case ( $\theta_{\ast}\in\mathbb{R})$ , the merging strategies we study can be described as solutions of the optimization problem

[TABLE]

for an appropriately defined convex function $\rho$ ; we investigate this class of estimators in detail. A natural extension to the case $\theta_{\ast}\in\mathbb{R}^{m}$ is to consider

[TABLE]

for some convex function $\rho$ and norm $\|\cdot\|_{\circ}$ . For example, if $\rho(x)=x$ , then $\widehat{\theta}^{(k)}$ becomes the spatial (also known as geometric or Haldane’s) median (Haldane, 1948; Small, 1990) of $\bar{\theta}_{1},\ldots,\bar{\theta}_{k}$ . Since the median remains stable as long as at least a half of the nodes in the system perform as expected, such model for distributed estimation is robust. The merging approach based on the various notions of the multivariate median has been previously considered by Minsker (2015) and Hsu and Sabato (2016); here, we analyze the setting when $\rho(x)=x$ and $\|\cdot\|_{\circ}$ is the $L_{1}$ -norm using the novel approach.

Existing results for the median-based merging strategies have several pitfalls related to the deviation rates, and in most cases known guarantees are suboptimal. In particular, these guarantees suggest that estimators obtained via the median-based approach are very sensitive to the choice of $k$ , the number of partitions. For instance, consider the problem of univariate mean estimation, where $X_{1},\ldots,X_{N}$ are i.i.d. copies of $X\in\mathbb{R}$ , and $\theta_{\ast}=\mathbb{E}X$ is the expectation of $X$ . Assume that $\mathrm{card}(G_{j})\geq n:=\lfloor N/k\rfloor$ for all $j$ , let $\bar{\theta}_{j}=\frac{1}{|G_{j}|}\sum_{i:X_{i}\in G_{j}}X_{i}$ be the empirical mean evaluated over the subsample $G_{j}$ , and define the “median-of-means” estimator via

[TABLE]

where $\mbox{med}\left(\cdot\right)$ is the usual univariate median. This estimator has been introduced by Nemirovski and Yudin (1983) in the context of stochastic optimization, and later appeared in (Jerrum et al., 1986) and (Alon et al., 1996). If $\mbox{Var}(X)=\sigma^{2}<\infty,$ it has been shown (for example, by Lerasle and Oliveira, 2011) that the median-of-means estimator $\widehat{\theta}^{(k)}$ satisfies

[TABLE]

with probability $\geq 1-e^{-k}$ . However, this bound, while being the current state of the art, does not tell us what happens at the confidence levels other than $1-e^{-k}$ . For example, if $k=\lfloor\sqrt{N}\rfloor$ , the only conclusion we can make is that $\left|\widehat{\theta}^{(k)}-\theta_{\ast}\right|\lesssim N^{-1/4}$ with high probability, which is far from the optimal rate $N^{-1/2}$ . And if we want the bound to hold with confidence 99% instead of $1-e^{-\sqrt{N}}$ , then, according to (3), we should take $k=\lfloor\log 100\rfloor+1=5$ , in which case the beneficial effect of parallel computation is very limited. The natural question to ask is the following: is the median-based merging step indeed suboptimal for large values of $k$ (e.g., $k=\lfloor\sqrt{N}\rfloor$ ), or is the problem related to the suboptimality of existing bounds? We claim that in many situations the latter is the case, and that previously known results can be strengthened: for instance, the statement of Corollary 2.6 below implies that whenever $\mathbb{E}|X-\theta_{\ast}|^{3}<\infty$ , the median-of-means estimator satisfies

[TABLE]

with probability $\geq 1-4e^{-2s}$ , for all $s\lesssim k$ . In particular, this inequality shows that the estimator (2) has “typical” deviations of order $N^{-1/2}$ whenever $k=O(\sqrt{N})$ , hence the “statistical cost” of employing a large number of computational nodes is minor. Moreover, we will prove that $\sqrt{N}\left(\widehat{\theta}^{(k)}-\theta_{\ast}\right)\xrightarrow{d}N\left(0,\frac{\pi}{2}\sigma^{2}\right)$ if $k\to\infty$ and $k=o(\sqrt{N})$ as $N\to\infty$ . It will also be demonstrated that improved bounds hold in other important scenarios, such as maximum likelihood estimation, even when the subgroups have different sizes and the observations are not identically distributed.

1.2 Organization of the paper.

Section 1.3 describes notation used throughout the paper. Sections 2 and 3 present main results and examples for the cases of univariate and multivariate parameter respectively. Outcomes of numerical simulation are discussed in section 4, and proofs of the main results are contained in section 5.

1.3 Notation.

Everywhere below, $\|\cdot\|_{1}$ and $\|\cdot\|_{2}$ stand for the $L_{1}$ and $L_{2}$ norms of a vector, and $\|\cdot\|$ - for the operator norm of a matrix (its largest singular value).

Given a probability measure $P$ , $\mathbb{E}_{P}(\cdot)$ will stand for the expectation with respect to $P$ , and we will write $\mathbb{E}(\cdot)$ when $P$ is clear from the context. Convergence in distribution will be denoted by $\xrightarrow{d}$ .

For two sequences $\left\{a_{j}\right\}_{j\geq 1}\subset\mathbb{R}$ and $\left\{b_{j}\right\}_{j\geq 1}\subset\mathbb{R}$ for $j\in\mathbb{N}$ , the expression $a_{j}\lesssim b_{j}$ means that there exists a constant $c>0$ such that $a_{j}\leq cb_{j}$ for all $j\in\mathbb{N}$ . Absolute constants will be denoted $c,C,c_{1}$ , etc., and may take different values in different parts of the paper. For a function $f:\mathbb{R}^{d}\mapsto\mathbb{R}$ , we define

[TABLE]

and $\|f\|_{\infty}:=\mathrm{ess\,sup}\{|f(x)|:\,x\in\mathbb{R}^{d}\}$ . Finally, $f_{+}(x)=\lim_{t\searrow 0}\frac{f(x+t)-f(x)}{t}$ and $f_{-}(x)=\lim_{t\nearrow 0}\frac{f(x+t)-f(x)}{t}$ will denote the right and left derivatives of $f$ respectively (whenever these limits exist). Additional notation and auxiliary results are introduced on demand for the proofs in section 5.

1.4 Main results.

As we have argued above, existing guarantees for the estimator (2) are sensitive to the choice of $k$ , the number of partitions. In the following sections, we demonstrate that these bounds are often suboptimal, and show that large values of $k$ often do not have a significant negative effect on the statistical performance of resulting algorithms.

The key observation underlying the subsequent exposition is the following: assume that the “local estimators” $\bar{\theta}_{j},\ 1\leq j\leq k$ , are asymptotically normal with asymptotic mean equal to $\theta_{\ast}$ . In particular, distributions of $\bar{\theta}_{j}$ ’s are approximately symmetric, with $\theta_{\ast}$ being the center of symmetry. The location parameters of symmetric distributions admits many robust estimators of the form (1), the sample median being a notable example.

This intuition allows us to establish a parallel between the non-asymptotic deviation guarantees for distributed estimation procedures of the form (1) and the degree of symmetry of “local” estimators quantified by the rates of convergence to normal approximation. Results for the univariate case are presented in section 2, and extensions to the multivariate case are presented in section 3.

2 The univariate case.

We assume that $X_{1},\ldots,X_{N}$ is a collection of independent (but not necessarily identically distributed) $S$ -valued random variables with distributions $P_{1},\ldots,P_{N}$ respectively. The data are partitioned into disjoint groups $G_{1},\ldots,G_{k}$ of cardinality $n_{j}:=\mathrm{card}(G_{j})$ each, and such that $\sum_{j=1}^{k}n_{j}=N$ . Let $\bar{\theta}_{j}:=\bar{\theta}_{j}(G_{j}),\ 1\leq j\leq k$ be a sequence of independent estimators of the parameter $\theta_{\ast}\in\mathbb{R}$ shared by $P_{1},\ldots,P_{N}$ . Our main assumption will be that $\bar{\theta}_{1},\ldots,\bar{\theta}_{k}$ are asymptotically normal as quantified by the following condition.

Assumption 1

Let $\Phi(t)$ be the cumulative distribution function of the standard normal random variable $Z\sim N(0,1)$ . For each $j=1,\ldots,k$ , there exist a sequence $\{\sigma^{(j)}_{n}\}_{n\in\mathbb{N}}\subset\mathbb{R}_{+}$ such that

[TABLE]

Clearly, functions $g_{j}(n_{j})$ , control the rate of convergence of estimators $\bar{\theta}_{1},\ldots,\bar{\theta}_{k}$ to the normal law. Furthermore, let

[TABLE]

be the harmonic mean of $\sigma_{n_{j}}^{(j)}$ ’s, and set $\alpha_{j}=\frac{H_{k}}{\sigma_{n_{j}}^{(j)}}$ . Note that $\sum_{j=1}^{k}\alpha_{j}=k$ , and that $\alpha_{1}=\ldots=\alpha_{k}=1$ if $\sigma_{n_{1}}^{(1)}=\ldots=\sigma_{n_{k}}^{(k)}$ .

2.1 Merging procedure based on the median.

In this subsection, we establish guarantees for the merging procedure based on the sample median, namely,

[TABLE]

This case is treated separately due to its practical importance, the fact that we can obtain better numerical constants, and a conceptually simpler proof.

Theorem 1

Assume that $s>0$ and $n_{j}=\mathrm{card}(G_{j}),\ j=1,\ldots,k$ are such that

[TABLE]

Moreover, let assumption 1 be satisfied, and let $\zeta_{j}(n_{j},s)$ solve the equation

[TABLE]

Then for all $s$ satisfying (4),

[TABLE]

with probability at least $1-4e^{-2s}$ .

Proof 2.2.

See section 5.2.

The following lemma yields a more explicit form of the bound and numerical constants.

Lemma 2.3.

Assume that $\frac{1}{k}\sum_{i=1}^{k}\left(g_{i}(n_{i})+\sqrt{\frac{s}{k}}\right)\cdot\max\limits_{j=1,\ldots,k}\alpha_{j}\leq 0.33$ . Then

[TABLE]

Proof 2.4.

See section 5.7.

Remark 2.5.

Let $\bar{\sigma}^{(1)}\leq\ldots\leq\bar{\sigma}^{(k)}$ be the non-decreasing rearrangement of $\sigma_{n_{1}}^{(1)},\ldots,\sigma_{n_{k}}^{(k)}$ . It is easy to see that the harmonic mean $H_{k}$ of $\sigma_{n_{1}}^{(1)},\ldots,\sigma_{n_{k}}^{(k)}$ satisfies

[TABLE]

for any integer $1\leq m\leq k$ , hence, informally speaking, the deviations of $\widehat{\theta}^{(k)}$ are controlled by the smallest $\sigma_{n_{j}}^{(j)}$ ’s rather than the largest.

2.2 Example: new bounds for the median-of-means estimator.

The univariate mean estimation problem is pervasive in statistics, and serves as a building block of more advanced methods such as empirical risk minimization. Early works on robust mean estimation include Tukey’s “trimmed mean” (Tukey and Harris, 1946), as well as “winsorized mean” (Bickel et al., 1965); also see discussion in (Bubeck et al., 2013). These techniques often produce estimators with significant bias. A different approach based on M-estimation was suggested by O. Catoni (Catoni, 2012); Catoni’s estimator yields almost optimal constants, however, its construction requires additional information about the variance or the kurtosis of the underlying distribution; moreover, its computation is not easily parallelizable, therefore this technique cannot be easily employed in the distributed setting.

Here, we will focus on a fruitful idea that is commonly referred to as the “median-of-means” estimator that was formally defined in equation (2) above. Several refinements and extensions of this estimator to higher dimensions have been recently introduced by Minsker (2015); Hsu and Sabato (2013); Devroye et al. (2016); Joly et al. (2016); Lugosi and Mendelson (2017). Advantages of this method include the facts that that it can be implemented in parallel and does not require prior knowledge of any information about parameters of the distribution (e.g., its variance). The following result for the median-of-means estimator is the corollary of Theorem 1; for brevity, we treat only the i.i.d. case. Recall that $n=\lfloor N/k\rfloor$ and $\mathrm{card}(G_{j})\geq n,\ j=1,\ldots,k$ .

Corollary 2.6.

Let $X_{1},\ldots,X_{N}$ be a sequence of i.i.d. copies of a random variable $X\in\mathbb{R}$ such that $\mathbb{E}X=\theta_{\ast}$ , $\mbox{Var}(X)=\sigma^{2}$ , $\mathbb{E}|X-\theta_{\ast}|^{3}<\infty$ , and set $c_{n}=0.4748\frac{\mathbb{E}|X-\theta_{\ast}|^{3}}{\sigma^{3}\sqrt{n}}$ . Then for all $s>0$ such that $c_{n}+\sqrt{\frac{s}{k}}\leq 0.33$ , the estimator $\widehat{\theta}^{(k)}$ defined in (2) satisfies

[TABLE]

with probability at least $1-4e^{-2s}$ .

Remark 2.7.

The term $1.43\sigma\frac{\mathbb{E}\left|X-\theta_{\ast}\right|^{3}/\sigma^{3}}{n}$ can be thought of as the “bias” due to asymmetry of the distribution of the sample mean. Note that whenever $k\lesssim\sqrt{N}$ (so that $n\gtrsim\sqrt{N}$ ), the right-hand side of the inequality above is of order $(kn)^{-1/2}\simeq N^{-1/2}$ .

Proof 2.8.

It follows from the Berry-Esseen Theorem (fact 1 in section 5.1) that assumption 1 is satisfied with $\sigma_{n}^{(1)}=\ldots=\sigma_{n}^{(k)}=\frac{\sigma}{\sqrt{n}}$ , and

[TABLE]

for all $j$ . Lemma 2.3 implies that $\max_{j}\zeta_{j}(n,s)\leq 3\frac{\sigma}{\sqrt{n}}\left(c_{n}+\sqrt{s/k}\right)$ , and the claim follows from Theorem 1.

For distributions with infinite third moment, the rate of convergence in the Berry-Esseen type bound is slower, and the following result holds instead.

Corollary 2.9.

Let $X_{1},\ldots,X_{N}$ be a sequence of i.i.d. copies of a random variable $X\in\mathbb{R}$ such that $\mathbb{E}X=\theta_{\ast}$ , $\mbox{Var}(X)=\sigma^{2}$ , $\mathbb{E}|X-\theta_{\ast}|^{2+\delta}<\infty$ for some $\delta\in(0,1]$ . Then there exist absolute constants $c_{1},c_{2}>0$ such that for all $s>0$ and $k$ satisfying $\frac{\mathbb{E}|X-\theta_{\ast}|^{2+\delta}}{\sigma^{2+\delta}n^{\delta/2}}+\sqrt{\frac{s}{k}}\leq c_{1}$ , the following inequality holds with probability at least $1-4e^{-2s}$ :

[TABLE]

In this case, typical deviations of $\widehat{\theta}^{(k)}$ are still of order $N^{-1/2}$ as long as $k\lesssim N^{\delta/(1+\delta)}$ . The proof of this result follows from fact 2 in section 5.1 in the same way as Corollary 2.6 was deduced from the Berry-Esseen bound.

2.3 Example: distributed maximum likelihood estimation.

Let $X_{1},\ldots,X_{N}$ be i.i.d. copies of a random vector $X\in\mathbb{R}^{d}$ with distribution $P_{\theta_{\ast}}$ , where $\theta_{\ast}\in\Theta\subseteq\mathbb{R}$ . Assume that for each $\theta\in\Theta$ , $P_{\theta}$ is absolutely continuous with respect to a $\sigma$ -finite measure $\mu$ , and let $p_{\theta}=\frac{dP_{\theta}}{d\mu}$ be the corresponding density. In this section, we state sufficient conditions for assumption 1 to be satisfied when $\bar{\theta}_{1},\ldots,\bar{\theta}_{k}$ are the maximum likelihood estimators (van der Vaart, 1998) of $\theta_{\ast}$ . Conditions stated below were obtained by Pinelis (2016). All derivatives below (denoted by ′) are taken with respect to $\theta$ , unless noted otherwise.

Assume that the the log-likelihood function $\ell_{x}(\theta)=\log p_{\theta}(x)$ satisfies the following:

(1)

$[\theta_{\ast}-\delta,\theta_{\ast}+\delta]\subseteq\Theta$ for some $\delta>0$ ; 2. (2)

“standard regularity conditions” that allow differentiation under the expectation: assume that $\mathbb{E}\ell^{\prime}_{X}(\theta_{\ast})=0$ , and that the Fisher information $\mathbb{E}\ell^{\prime}_{X}(\theta_{\ast})^{2}=-\mathbb{E}\ell^{\prime\prime}_{X}(\theta_{\ast}):=I(\theta_{\ast})$ is finite; 3. (3)

$\mathbb{E}\left|\ell^{\prime}_{X}(\theta_{\ast})\right|^{3}+\mathbb{E}\left|\ell^{\prime\prime}_{X}(\theta_{\ast})\right|^{3}<\infty$ ; 4. (4)

for $\mu$ -almost all $x$ , $\ell_{x}(\theta)$ is three times differentiable for $\theta\in[\theta_{\ast}-\delta,\theta_{\ast}+\delta]$ , and $\mathbb{E}\sup_{|\theta-\theta_{\ast}|\leq\delta}\left|\ell_{X}^{\prime\prime\prime}(\theta)\right|^{3}<\infty$ ; 5. (5)

$\mathbb{P}{\left(|\bar{\theta}_{1}-\theta_{\ast}|\geq\delta\right)}\leq c\gamma^{n}$ for some positive constants $c$ and $\gamma\in[0,1)$ .

In turn, condition (5) above is implied by the following two inequalities (see Pinelis, 2016, section 6.2, for detailed discussion and examples):

$H^{2}(\theta,\theta_{\ast})\geq 2-\frac{2}{\left(1+c_{0}(\theta-\theta_{\ast})^{2}\right)^{\gamma}}$ , where $H(\theta_{1},\theta_{2})=\sqrt{\int_{\mathbb{R}^{d}}\left(\sqrt{p_{\theta_{1}}}-\sqrt{p_{\theta_{2}}}\right)^{2}d\mu}$ is the Hellinger distance, and $c_{0},\gamma$ are positive constants; 2. 2.

$I(\theta)\leq c_{1}+c_{2}\left|\theta\right|^{\alpha}$ for some positive constants $c_{1},c_{2}$ and $\alpha$ and all $\theta\in\Theta$ .

Corollary 2.10.

Assume that conditions (1)-(5) are satisfied, and that $\mathrm{card}(G_{j})\geq n=\lfloor N/k\rfloor,\ j=1,\ldots,k$ . Then for all $s>0$ such that $\frac{\mathfrak{C}}{\sqrt{n}}+c\gamma^{n}+\sqrt{\frac{s}{k}}\leq 0.33$ ,

[TABLE]

with probability at least $1-4e^{-2s}$ , where $\mathfrak{C}$ is a positive constant that depends only on $\{P_{\theta}\}_{\theta\in[\theta_{\ast}-\delta,\theta_{\ast}+\delta]}$ .

Proof 2.11.

It follows from results in (Pinelis, 2016), in particular equation (5.5), that whenever conditions (1)-(5) hold, assumption 1 is satisfied for all $j$ with $\sigma_{n}^{(j)}=\left(nI(\theta_{\ast})\right)^{-1/2}$ , where $I(\theta_{\ast})$ is the Fisher information, and $g_{j}(n)\leq\frac{\mathfrak{C}}{\sqrt{n}}+c\gamma^{n}$ , where $\mathfrak{C}$ is a constant that depends only on $\{P_{\theta}\}_{\theta\in[\theta_{\ast}-\delta,\theta_{\ast}+\delta]}$ . Lemma 2.3 implies that

[TABLE]

and the claim follows from Theorem 1.

Remark 2.12.

Results of this section can be extended to include other M-estimators besides MLEs, as Bentkus et al. (1997) have shown that M-estimators satisfy a variant of Berry-Esseen bound under rather general conditions.

2.4 Merging procedures based on robust M-estimators.

In this subsection, we study the family of merging procedures based on the M-estimators

[TABLE]

The sample median $\mbox{med}\left(\bar{\theta}_{1},\ldots,\bar{\theta}_{k}\right)$ corresponds to the choice of (non-smooth) $\rho(x)=|x|$ and was treated separately above; here, it will be assumed that $\rho$ is convex, even, differentiable function such that $\rho(z)\to\infty$ as $|z|\to\infty$ and $\|\rho^{\prime}\|_{\infty}<\infty$ . A particular example of such a function is Huber’s loss

[TABLE]

where $M$ is a positive constant. The following result quantifies non-asymptotic performance of the estimator $\widehat{\theta}^{(k)}_{\rho}$ . As before, we set

[TABLE]

where $\sigma_{n}^{(j)}$ ’s are defined in assumption 1. Moreover, given the loss $\rho$ as above, let $C_{\rho}>0$ be such that $|\rho^{\prime}(x)|\geq\frac{\|\rho^{\prime}\|_{\infty}}{2}$ for $|x|>C_{\rho}$ .

Theorem 2.13.

Let assumption 1 be satisfied, and suppose that $s>0$ and $n_{1},\ldots,n_{k}$ are such that

[TABLE]

Then for all $s$ satisfying (8),

[TABLE]

with probability at least $1-4e^{-2s}$ .

Proof 2.14.

See section 5.3.

Note that the bound depends on $\rho$ only through $\max_{j=1,\ldots,k}e^{\left(C_{\rho}/\sigma_{n_{j}}^{(j)}\right)^{2}}$ . Assume for concreteness that $n_{1}=\ldots=n_{k}=\lfloor N/k\rfloor$ , and that $\rho=\rho_{M}$ is Huber’s loss defined in (6), so that $C_{\rho}=M/2$ . For $\max_{j=1,\ldots,k}e^{\left(C_{\rho}/\sigma_{n_{j}}^{(j)}\right)^{2}}$ to be bounded above by an absolute constant, one should choose $M$ to be of order $\min_{j=1,\ldots,k}\sigma_{n_{j}}^{(j)}$ . While the latter quantity is typically unknown, it can be estimated in some cases. For example, if the data are i.i.d. then $\sigma_{n_{j}}^{(j)}=\sqrt{\mbox{Var}\left(\bar{\theta}_{1}\right)}$ for all $j$ . Since $\bar{\theta}_{j}$ ’s are approximately normal, their standard deviation can be estimated by the median absolute deviation as

[TABLE]

where the factor $1/\Phi^{-1}(0.75)$ is introduced to make the estimator consistent (Hampel et al., 2011); another possibility is to use bootstrap (Ghosh et al., 1984).

2.5 Asymptotic results.

In this section, we complement the previously discussed non-asymptotic deviation bounds for $\widehat{\theta}^{(k)}_{\rho}$ by the asymptotic results. For the benefits of clarity, we state the complete list of assumptions made below:

$X_{1},\ldots,X_{N}$ are i.i.d., $n=\lfloor N/k\rfloor$ and $\mathrm{card}(G_{j})=n,\ j=1,\ldots,k$ ; result for non-identically distributed data is presented in Appendix A. 2. 2.

Assumption 1 is satisfied for some function $g(n)$ (note that there is no dependence on index $j$ due to the i.i.d. assumption); 3. 3.

$k$ and $n$ are such that $k\to\infty$ and $\sqrt{k}\cdot g(n)\to 0$ as $N\to\infty$ ; 4. 4.

$\rho$ is a convex, even function, such that $\rho(z)\to\infty$ as $|z|\to\infty$ and $\|\rho^{\prime}\|_{\infty}<\infty$ (here, $\rho^{\prime}(x)$ is defined as the average of the right and left derivatives of $\rho$ at $x$ ). 5. 5.

$\widehat{\theta}^{(k)}_{\rho}$ is defined as

[TABLE]

where $\sigma_{n}^{(1)}=\ldots=\sigma_{n}^{(k)}\equiv\sigma_{n}$ is a normalizing sequence from assumption 1 (our definition of the estimator is slightly different than in section 2.4 which allows to keep $\rho$ fixed as $k$ and $n$ are changing).

For $z\in\mathbb{R}$ , define

[TABLE]

where $Z\sim N(0,1)$ . Note that, since $\rho$ is differentiable almost everywhere, $L(z)=\mathbb{E}\rho^{\prime}_{-}(z+Z)=\mathbb{E}\rho^{\prime}_{+}(z+Z)$ .

Theorem 2.15.

Under assumptions (a)-(e) above,

[TABLE]

where $\Delta^{2}=\frac{\mathbb{E}\left(\rho^{\prime}(Z)\right)^{2}}{\left(L^{\prime}(0)\right)^{2}}$ .

Proof 2.16.

See section 5.4.

For example, if $\rho(x)=|x|$ , Theorem 2.15 implies that under appropriate assumptions, the median-of-means estimator $\widehat{\theta}^{(k)}$ defined in (2) satisfies

[TABLE]

Indeed, in this case $\sigma_{n}=\sigma/\sqrt{n}$ , where $\sigma^{2}=\mbox{Var}(X_{1})$ , and

[TABLE]

hence a simple calculation yields $\Delta^{2}=1/(L^{\prime}(0))^{2}=\pi/2$ .

If we consider the mean estimation problem with Huber’s loss $\rho_{M}(x)$ (6) instead of $\rho(x)=|x|$ , we similarly deduce that

[TABLE]

and we get the well-known (Huber, 1964) expression $\Delta^{2}=\frac{\int_{-M}^{M}x^{2}d\Phi(x)+2M^{2}(1-\Phi(M))}{\left(2\Phi(M)-1\right)^{2}}$ ; in particular, $\Delta^{2}\to 1$ as $M\to\infty$ , and the convergence is fast. For instance, $\Delta^{2}\simeq 1.15$ for $M=2$ and $\Delta^{2}\simeq 1.01$ for $M=3$ .

Remark 2.17.

The key assumptions in the list (a)-(e) governing the regime of growth of $k$ and $n$ are (b) and (c). For instance, if the random variables possess finite moments of order $(2+\delta)$ for some $\delta\in(0,1]$ , then it follows from fact 2 in section 5.1 that $\sqrt{k}\,g(n)\to 0\text{ if }k=o\left(N^{\frac{\delta}{1+\delta}}\right)$ as $N\to\infty$ .

2.6 Connections to U-quantiles.

In this section, we discuss connections of proposed algorithms to U-quantiles and the assumption requiring the groups $G_{1},\ldots,G_{k}$ to be disjoint. We assume that the data $X_{1},\ldots,X_{N}$ are i.i.d. with common distribution $P$ , and let $\theta_{\ast}=\theta_{\ast}(P)\in\mathbb{R}$ be a real-valued parameter of interest. It is clear that the estimators produced by distributed algorithms considered above depend on the random partition of the sample. A natural way to avoid such dependence is to consider the U-quantile (in this case, the median)

[TABLE]

where $\mathcal{A}_{N}^{(n)}:=\left\{J:\ J\subseteq\{1,\ldots,N\},\mathrm{card}(J)=n:=\lfloor N/k\rfloor\right\}$ is a collection of all distinct subsets of $\{1,\ldots,N\}$ of cardinality $n$ , and $\bar{\theta}_{J}:=\bar{\theta}(X_{j},\ j\in J)$ is an estimator of $\theta_{\ast}$ based on $\{X_{j},\ j\in J\}$ . For instance, when $\mathrm{card}(J)=2$ and $\bar{\theta}_{J}=\frac{1}{\mathrm{card}(J)}\sum_{j\in J}\frac{X_{j}}{2}$ , $\widetilde{\theta}^{(k)}$ is the well-known Hodges-Lehmann estimator of the location parameter, see (Hodges and Lehmann, 1963; Lehmann and D’Abrera, 2006); for a comprehensive study of U-quantiles, see (Arcones, 1996). The main result of this section is an analogue of Theorem 1 for the estimator $\widetilde{\theta}^{(k)}$ ; it implies that theoretical guarantees for the performance of $\widetilde{\theta}^{(k)}$ are at least as good as for the estimator $\widehat{\theta}^{(k)}$ . Since the data are i.i.d., it is enough to impose the assumption 1 on $\bar{\theta}\left(X_{1},\ldots,X_{n}\right)$ only, hence we drop the index $j$ and denote the normalizing sequence $\{\sigma_{n}\}_{n\in\mathbb{N}}$ and the corresponding error function $g(n)$ .

Theorem 2.18.

Assume that $s>0$ and $n=\lfloor N/k\rfloor$ are such that

[TABLE]

Moreover, let assumption 1 be satisfied, and let $\zeta(n,s)$ solve the equation

[TABLE]

Then for any $s$ satisfying (10),

[TABLE]

with probability at least $1-4e^{-2s}$ .

Proof 2.19.

See section 5.5. As before, a more explicit form of the bound immediately follows from Lemma 2.3.

A drawback of the estimator $\widetilde{\theta}^{(k)}$ is the fact that its exact computation requires evaluation of $n\choose N$ estimators $\bar{\theta}_{J}$ over subsamples $\left\{\{X_{j},\ j\in J\},\ J\in\mathcal{A}_{N}^{(n)}\right\}$ . For large $N$ and $n$ , such task becomes intractable. However, an approximate result can be obtained by choosing $\ell$ subsets $J_{1},\ldots,J_{\ell}$ from $\mathcal{A}_{N}^{(n)}$ uniformly at random, and setting $\widetilde{\theta}_{\ell}^{(k)}:=\mbox{med}\left(\bar{\theta}_{J_{1}},\ldots,\bar{\theta}_{J_{\ell}}\right)$ . Typically, the error $\left|\widetilde{\theta}_{\ell}^{(k)}-\widetilde{\theta}^{(k)}\right|$ is of order $\ell^{-1/2}$ with high probability over the random draw of $J_{1},\ldots,J_{\ell}$ .

We note that Theorem 2.13 admits a similar extension for the estimator defined as

[TABLE]

Namely, if the data are i.i.d., then under the assumptions of section 2.4,

[TABLE]

with probability at least $1-4e^{-2s}$ , whenever $s>0$ and $n=\lfloor N/k\rfloor$ are such that

[TABLE]

We omit the proof of (11) since the required modifications in the argument of Theorem 2.13 are exactly the same as those explained in the proof of Theorem 2.18.

3 Estimation in higher dimensions.

In this section, it will be assumed that $\theta_{\ast}\in\mathbb{R}^{m},\ m\geq 2$ , is a vector-valued parameter of interest. Let $X_{1},\ldots,X_{N}$ be independent $S$ -valued random variables that are randomly partitioned into disjoint groups $G_{1},\ldots,G_{k}$ of cardinality $n=\lfloor N/k\rfloor$ each. Let $\bar{\theta}_{j}:=\bar{\theta}_{j}(G_{j})\in\mathbb{R}^{m},\ 1\leq j\leq k$ be a sequence of estimators of $\theta_{\ast}$ , the common parameter of the distributions of $X_{j}$ ’s. Assume that $\rho_{1},\ldots,\rho_{m}$ are convex, even functions such that $\rho_{i}(z)\to\infty$ as $|z|\to\infty$ and $\|\rho_{i}^{\prime}\|_{\infty}<\infty$ , with $\rho_{i}^{\prime}(x)$ defined as the average of the right and left derivatives of $\rho_{i}$ , $i=1,\ldots,m$ , and let

[TABLE]

where $z=(z_{1},\ldots,z_{m})$ and $\bar{\theta}_{j}=(\bar{\theta}_{j,1},\ldots,\bar{\theta}_{j,m})$ for $1\leq j\leq k$ .

For the sake of clarity, we will assume below that $X_{1},\ldots,X_{N}$ are i.i.d. However, results can be easily extended to the case of non-identically distributed data in a manner described in section 2.4. Assumption 1 will be required to hold coordinatewise, namely, we will assume that there exist sequences $\{\sigma_{n,i}\}_{n\in\mathbb{N}}\subset\mathbb{R}_{+},\ i=1,\ldots,m$ , such that

[TABLE]

Note that the maximum over the second index $j$ disappears due to the i.i.d. assumption.

Theorem 3.20.

Let $C_{\rho_{i}}>0$ be such that $|\rho_{+,i}^{\prime}(x)|\geq\frac{\|\rho_{+,i}^{\prime}\|_{\infty}}{2}$ and $|\rho_{-,i}^{\prime}(x)|\geq\frac{\|\rho_{-,i}^{\prime}\|_{\infty}}{2}$ for $|x|>C_{\rho_{i}},\ i=1,\ldots,m$ . Let assumption 1 hold for each coordinate of $\bar{\theta}_{1}$ , and suppose that $s>0$ and $n=\lfloor N/k\rfloor$ are such that

[TABLE]

Then for all $s$ satisfying (13) and all $1\leq i\leq m$ simultaneously,

[TABLE]

with probability at least $1-4me^{-2s}$ .

Proof 3.21.

See section 5.6.

3.1 Example: multivariate median-of-means estimator.

Consider the special case of Theorem 3.20 when $\theta_{\ast}=\mathbb{E}X$ is the mean of $X\in\mathbb{R}^{m}$ , $\bar{\theta}_{j}(X):=\frac{1}{|G_{j}|}\sum_{X_{i}\in G_{j}}X_{i}$ is the sample mean evaluated over the subsample $G_{j}$ , and $\rho_{i}(x)=|x|$ for all $i$ . In this case, $\widehat{\theta}^{(k)}$ becomes the spatial median with respect to the $L_{1}$ -norm, namely,

[TABLE]

The problem of finding the mean estimator that admits sub-Gaussian concentration around $\mathbb{E}X$ under weak moment assumptions on the underlying distribution has recently been investigated in several works. For instance, Joly et al. (2016) construct an estimator that admits “almost optimal” behavior under the assumption that the entries of $X$ possess 4 moments. Recently, Lugosi and Mendelson (2017, 2018) proposed new estimators that attains optimal bounds and requires existence of only 2 moments. More specifically, the aforementioned papers show that, for any $s$ such that $\frac{2}{N}<e^{-s}<1$ , there exists an estimator $\hat{\theta}_{(s)}$ such that with probability at least $1-C_{1}e^{-s}$ ,

[TABLE]

where $C_{1},C_{2}>0$ are numerical constants, $\Sigma$ is the covariance matrix of $X$ , $\mbox{tr}\,(\Sigma)$ is its trace and $\lambda_{\mathrm{max}}(\Sigma)$ - its largest eigenvalue. However, construction of these estimators explicitly depends on the desired confidence level $s$ , and (more importantly) they are numerically difficult to compute.

On the other hand, Theorem 3.20 demonstrates that performance of the multivariate median-of-means estimator is robust with respect to the choice of the number of subgroups $k$ , and the resulting deviation bounds hold simultaneously over the range of confidence parameter $s$ whenever the coordinates of $X$ possess $2+\delta$ moments for some $\delta>0$ . The following corollary summarizes these claims.

Corollary 3.22.

Let $X_{1},\ldots,X_{N}$ be i.i.d. random vectors such that $\theta_{\ast}=\mathbb{E}X_{1}$ is the unknown mean, $\Sigma=\mathbb{E}\left[(X_{1}-\theta_{\ast})(X_{1}-\theta_{\ast})^{T}\right]$ is the covariance matrix, $\sigma_{i}^{2}=\Sigma_{i,i}$ , and $\max_{i=1,\ldots,m}\mathbb{E}|X_{1,i}|^{2+\delta}<\infty$ for some $\delta\in(0,1]$ . Then there exist absolute constants $c_{1},c_{2}>0$ such that for all $s>0$ and $k$ satisfying

[TABLE]

with probability at least $1-4me^{-2s}$ for all $i=1,\ldots,m$ simultaneously,

[TABLE]

Proof 3.23.

It follows from fact 2 in section 5.1 that $g_{m}(n)$ can be bounded as

[TABLE]

for an absolute constant $A>0$ . Moreover, it is easy to see that $C_{\rho_{i}}=0$ for all $i$ and that assumption 1 holds with $\sigma_{n,i}=\frac{\sigma_{i}}{\sqrt{n}}$ . Now the claim immediately follows from Theorem 3.20.

Remark 3.24.

Estimator (15) admits a natural generalization of the form

[TABLE]

where $\|\cdot\|_{\circ}$ is a norm in $\mathbb{R}^{m}$ and $\rho$ is a convex, non-decreasing function. For example, if $\|\cdot\|_{\circ}$ is the Euclidean norm, resulting estimator is invariant with respect to the orthogonal transformations. However, available performance guarantees for this estimator hold under stronger assumptions (such as joint asymptotic normality of the coordinates of $\bar{\theta}_{j}$ ’s instead of coordinate-wise asymptotic normality), and exhibit suboptimal dependence on the dimension; these results, along with the discussion of relevant numerical methods, are presented in Appendix C. Complete characterization of the effect of the norm $\|\cdot\|_{\circ}$ on the geometry of the problem and performance of the corresponding estimator (16) warrants further study.

4 Simulation results.

We illustrate results of the previous sections with numerical simulations that compare performance of the median-of-means estimator with the usual sample mean, see figure 2 below.

Moreover, we compared the theoretical guarantees for the median-of-means estimator (described in section 2.2) against the empirical outcomes for the Lomax distribution with shape parameter $\alpha=4$ and scale parameter $\lambda=1$ ; the corresponding probability density function is

[TABLE]

In particular, the Lomax distribution with $\alpha=4$ and $\lambda=1$ has mean $1/3$ and median $\sqrt[4]{2}-1\approx 0.1892$ . Since the mean and median do not coincide, the error of the median-of-means estimator has a significant bias component for large values of $k$ . Figure 3 depicts the impact of the bias beyond $k=\sqrt{N}$ (equivalently, $\log_{N}k=1/2$ ), and also the fact that the median error is mostly flat for $k<\sqrt{N}$ .

Finally, we assessed empirical coverage of the confidence intervals constructed using Theorem 2.15 and centered at the median-of-means estimator; results are presented in figure 4. The sample of size $N=10^{5}$ was generated from the half-t distribution with $3$ degrees of freedom; recall that a random variable $\xi$ has half-t distribution with $\nu$ degrees of freedom if $\xi\stackrel{{\scriptstyle\mathrm{d}}}{{=}}|\eta|$ where $\eta$ has usual t-distribution with $\nu$ degrees of freedom. It is clear that half-t distribution is both asymmetric and heavy-tailed. Each sample was further corrupted by outliers sampled from the normal distribution with mean [math] and standard deviation $10^{5}$ ; the number of outliers ranged from [math] to $\sqrt{N}=100$ with increments of $20$ . The median-of-means estimator was constructed for $k=\sqrt{N}=100$ . For comparison, we present empirical coverage levels attained by the sample mean in the same framework.

5 Proofs

In this section, we present the proofs of the main results.

5.1 Preliminaries.

We recall several facts that are used in the proofs below. The following bound has been established by A. Berry (Berry, 1941) and C.-G. Esseen (Esseen, 1942). A version with an explicit constant given below is due to Shevtsova (2011).

Fact 1 (Berry-Esseen bound).

Assume that $Y_{1},\ldots,Y_{n}$ is a sequence of i.i.d. copies of a random variable $Y$ with mean $\mu$ , variance $\sigma^{2}$ and such that $\mathbb{E}|Y|^{3}<\infty$ . Then

[TABLE]

where $\bar{Y}=\frac{1}{n}\sum_{j=1}^{n}Y_{j}$ and $\Phi(s)$ is the cumulative distribution function of the standard normal random variable.

The following generalization of Berry-Esseen bound is due to Petrov (1995).

Fact 2 (Generalization of Berry-Esseen bound).

Assume that $Y_{1},\ldots,Y_{n}$ is a sequence of i.i.d. copies of a random variable $Y$ with mean $\mu$ , variance $\sigma^{2}$ and such that $\mathbb{E}|Y|^{2+\delta}<\infty$ for some $\delta\in(0,1]$ . Then there exists an absolute constant $A>0$ such that

[TABLE]

Next, we recall a well-known concentration inequality.

Fact 3 (Bounded difference inequality).

Let $X_{1},\ldots,X_{k}$ be i.i.d. random variables, and assume that $Z=g(X_{1},\ldots,X_{k})$ , where $g$ is such that for all $j=1,\ldots,k$ and all $x_{1},x_{2},\ldots,x_{j},x_{j}^{\prime},\ldots,x_{k}$ ,

[TABLE]

Then

[TABLE]

and

[TABLE]

Finally, we recall the definition of a U-statistic. Let $h:\mathbb{R}^{n}\mapsto\mathbb{R}$ be a measurable function of $n$ variables, and

[TABLE]

A U-statistic of order $n$ with kernel $h$ based on the i.i.d. sample $X_{1},\ldots,X_{N}$ is defined as (Hoeffding, 1948)

[TABLE]

Clearly, $\mathbb{E}U_{N}(h)=\mathbb{E}h(X_{1},\ldots,X_{n})$ , moreover, $U_{N}(h)$ has the smallest variance among all unbiased estimators. The following analogue of fact 3 holds for the U-statistics:

Fact 4 (Concentration inequality for U-statistics, (Hoeffding, 1963)).

**

Assume that the kernel $h$ satisfies $\left|h(x_{1},\ldots,x_{n})\right|\leq M$ for all $x_{1},\ldots,x_{n}$ . Then for all $s>0$ ,

[TABLE]

5.2 Proof of Theorem 1.

Observe that

[TABLE]

Let $\Phi^{(n_{j},j)}(\cdot)$ be the distribution function of $\bar{\theta}_{j}-\theta_{\ast},\ j=1,\ldots,k,$ and $\widehat{\Phi}_{k}(\cdot)$ - the empirical distribution function corresponding to the sample $W_{1}=\bar{\theta}_{1}-\theta_{\ast},\ldots,W_{k}=\bar{\theta}_{k}-\theta_{\ast}$ , that is,

[TABLE]

Suppose that $z\in\mathbb{R}$ is fixed, and note that $\widehat{\Phi}_{k}(z)$ is a function of the random variables $W_{1},\ldots,W_{k}$ , and $\mathbb{E}\widehat{\Phi}_{k}(z)=\frac{1}{k}\sum_{j=1}^{k}\Phi^{(n_{j},j)}(z)$ . Moreover, the hypothesis of the bounded difference inequality (fact 3) is satisfied with $c_{j}=1/k$ for $j=1,\ldots,k$ , and therefore it implies that

[TABLE]

on the draw of $W_{1},\ldots,W_{k}$ with probability $\geq 1-2e^{-2s}$ .

Let $z_{1}\geq z_{2}$ be such that $\frac{1}{k}\sum_{j=1}^{k}\Phi^{(n_{j},j)}(z_{1})\geq\frac{1}{2}+\sqrt{\frac{s}{k}}$ and $\frac{1}{k}\sum_{j=1}^{k}\Phi^{(n_{j},j)}(z_{2})\leq\frac{1}{2}-\sqrt{\frac{s}{k}}$ . Applying (17) for $z=z_{1}$ and $z=z_{2}$ together with the union bound, we see that for $j=1,2$ ,

[TABLE]

on an event $\mathcal{E}$ of probability $\geq 1-4e^{-2s}$ . It follows that on $\mathcal{E}$ , $\widehat{\Phi}_{k}(z_{1})\geq 1/2$ and $1-\widehat{\Phi}_{k}(z_{2})\geq 1/2$ simultaneously, hence

[TABLE]

by the definition of the median. It remains to estimate $z_{1}$ and $z_{2}$ . Assumption 1 implies that

[TABLE]

Hence, it suffices to find $z_{1}$ such that $\frac{1}{k}\sum_{j=1}^{k}\Phi\left(\frac{z_{1}}{\sigma_{n_{j}}^{(j)}}\right)\geq\frac{1}{2}+\sqrt{\frac{s}{k}}+\frac{1}{k}\sum_{j=1}^{k}g_{j}(n_{j})$ . Recall that $\alpha_{j}=\frac{1/\sigma_{n_{j}}^{(j)}}{1/k\sum_{i=1}^{k}1/\sigma_{n_{j}}^{(i)}},\ j=1,\ldots,k$ , and let $\zeta_{j}(n_{j},s)$ be the solution of the equation

[TABLE]

Note that $\zeta_{j}(n,s)$ always exists since $\alpha_{j}\cdot\frac{1}{k}\sum_{i=1}^{k}\left(g_{i}(n_{i})+\sqrt{\frac{s}{k}}\right)<\frac{1}{2}$ by assumption. Finally, since $\sum_{j=1}^{k}\alpha_{j}=k$ , it is clear that any

[TABLE]

satisfies the requirements. Similarly,

[TABLE]

by assumption 1, hence it is sufficient to choose $z_{2}$ such that $z_{2}\leq\max_{j=1,\ldots,k}\tilde{\zeta}_{j}(n_{j},s)$ , where $\tilde{\zeta}_{j}(n_{j},s)$ satisfies $\Phi\left(\tilde{\zeta}_{j}(n_{j},s)/\sigma_{n}^{(j)}\right)-\frac{1}{2}=-\alpha_{j}\cdot\frac{1}{k}\sum_{i=1}^{k}\left(g_{i}(n_{i})+\sqrt{\frac{s}{k}}\right)$ . Noting that $\tilde{\zeta}_{j}(n_{j},s)=-\zeta_{j}(n_{j},s)$ and recalling (18), we conclude that

[TABLE]

with probability at least $1-4e^{-2s}$ .

5.3 Proof of Theorem 2.13.

We will use notation as in the proof of Theorem 1. Clearly, $\widehat{\theta}_{\rho}^{(k)}$ satisfies the equation $G(\widehat{\theta}_{\rho}^{(k)})=0$ , where

[TABLE]

Suppose $z_{1},z_{2}$ are such that $G(z_{1})>0$ and $G(z_{2})<0$ . Since $G$ is increasing, it is easy to see that $\widehat{\theta}_{\rho}^{(k)}\in(z_{2},z_{1})$ . To find such $z_{1}$ and $z_{2}$ , we proceed in 3 steps.

(a) First, observe that the bounded difference inequality (fact 3) implies that for any fixed $z\in\mathbb{R}$ ,

[TABLE]

with probability $\geq 1-2e^{-2s}$ .

(b) Next, we will find an upper bound for

[TABLE]

where $Z_{j}\sim N\left(\theta_{\ast},\left(\sigma_{n_{j}}^{(j)}\right)^{2}\right),\ j=1,\ldots,k$ are independent. Note that for any bounded non-negative function $f:\mathbb{R}\mapsto\mathbb{R}_{+}$ and a signed measure $Q$ ,

[TABLE]

Since any bounded function $f$ can be written as $f=\max(f,0)-\max(-f,0)$ , we deduce that

[TABLE]

Moreover, if $f$ is monotone, the sets $\{x:\,f(x)\geq t\}$ and $\{x:\,f(x)\leq t\}$ are half-intervals. Applying this to $f=\rho^{\prime}$ and $Q=\Phi^{(n_{j},j)}-\Phi$ , we deduce that

[TABLE]

by assumption 1.

(c) In remains to find $z_{1}$ satisfying

[TABLE]

Let $\tilde{z}_{1}:=z_{1}-\theta_{\ast}$ and $\tilde{Z}_{j}:=Z_{j}-\theta_{\ast}$ . Since $\sum_{j=1}^{k}\alpha_{j}=k$ (where $\alpha_{j}$ ’s were defined in (7)), it suffices to find $z_{1}$ such that $\mathbb{E}\rho^{\prime}\left(\tilde{z}_{1}-\tilde{Z}_{j}\right)>\alpha_{j}\left\|\rho^{\prime}\right\|_{\infty}\left(\sqrt{\frac{s}{k}}+\frac{2}{k}\sum_{i=1}^{k}g_{i}(n_{i})\right)$ for all $j$ . For any bounded function $h$ such that $h(-x)=-h(x)$ and $h(x)\geq 0$ for $x\geq 0$ , and any $z\geq 0$ ,

[TABLE]

where $\phi(x)=(2\pi)^{-1/2}e^{-x^{2}/2}$ . Recall that $C_{\rho}>0$ is such that $|\rho^{\prime}(x)|\geq\|\rho^{\prime}\|_{\infty}/2$ for $|x|\geq C_{\rho}$ . It follows that

[TABLE]

where $Z\sim N(0,1)$ . Next, Lemma B.29 implies that

[TABLE]

Combining the previous two bounds, we deduce that it suffices to find $\tilde{z}_{1}>0$ such that

[TABLE]

By our assumptions, $\max_{j=1,\ldots,k}\alpha_{j}\,e^{\left(C_{\rho}/\sigma_{n_{j}}^{(j)}\right)^{2}}\left(\sqrt{\frac{s}{k}}+\frac{2}{k}\sum_{i=1}^{k}g_{i}(n_{i})\right)\leq 0.33$ . Lemma 2.3 yields that it suffices to take

[TABLE]

The estimate for $z_{2}$ follows the same pattern, and yields that one can take $z_{2}$ as

[TABLE]

implying the claim.

5.4 Proof of Theorem 2.15.

Recall that $L(z)=\mathbb{E}\rho^{\prime}(z+Z)$ for $Z\sim N(0,1)$ , and note that under our assumptions, equation $L(z)=0$ has a unique solution $z=0$ (even if $\rho$ is not strictly convex). Next, observe that

[TABLE]

hence it suffices to show that both the left-hand side and the right-hand side of the inequality above converge to $1-\Phi(t)$ for all $t$ . We will outline the argument for the left-hand side, and the remaining part is proven in a similar fashion. Note that

[TABLE]

where $Y_{n,j}=\rho^{\prime}_{-}\left(\frac{\theta_{\ast}-\bar{\theta}_{j}+\frac{t\Delta\sigma_{n}}{\sqrt{k}}}{\sigma_{n}}\right)$ .

Lemma 5.25.

Under the assumptions of Theorem 2.15, $\sqrt{k}\mathbb{E}Y_{n,1}\to t\,\Delta\,L^{\prime}(0)$ and

[TABLE]

where $Z\sim N(0,1)$ .

Proof 5.26 (of Lemma 5.25).

Let $Z\sim N(0,1)$ . Since $\rho$ is convex, its derivative $\rho^{\prime}:=(\rho^{\prime}_{+}+\rho^{\prime}_{-})/2$ is monotone and continuous almost everywhere (with respect to Lebesgue measure). Together with the assumption that $\|\rho^{\prime}\|_{\infty}<\infty$ , Lebesgue dominated convergence Theorem implies that

[TABLE]

Next, we will prove the assertion that $\sqrt{k}\mathbb{E}Y_{n,1}\to t\,\Delta\,L^{\prime}(0)$ . It is easy to see that

[TABLE]

Reasoning as in the proof of Theorem 2.13 (see step (b) in section 5.3), we deduce that

[TABLE]

where $g(n)$ is the function from assumption 1. Hence, recalling that $g(n)\sqrt{k}\to 0$ as $N\to\infty$ , we obtain that

[TABLE]

On the other hand, it follows from (20) that for $t\neq 0$

[TABLE]

For $t=0$ , it is also clear that $\mathbb{E}\rho^{\prime}\left(Z\right)=0$ . To establish the fact that $\sqrt{\mbox{Var}\left(Y_{n,1}\right)}~{}\to~{}\sqrt{\mathbb{E}\left(\rho^{\prime}(Z)\right)^{2}}$ , note that weak convergence of $\frac{\bar{\theta}_{1}-\theta_{\ast}}{\sigma_{n}}$ to the normal law (assumption 1) together with Lebesgue dominated convergence Theorem implies that

[TABLE]

Since $L^{\prime}(0)>0$ , we deduce that

[TABLE]

and the claim follows.

Lemma 5.25 implies that $-\frac{\sqrt{k}\,\mathbb{E}Y_{n,1}}{\sqrt{\mbox{Var}\left(Y_{n,1}\right)}}\xrightarrow{N\to\infty}t$ . It remains to apply Lindeberg’s Central Limit Theorem (Serfling, 1981, Theorem 1.9.3) to $Y_{n,j}$ ’s to deduce the result from equation (19). To this end, we only need to verify the Lindeberg condition requiring that for any $\varepsilon>0$ ,

[TABLE]

However, since $\rho^{\prime}(\cdot)$ (and hence $Y_{n,1}$ ) is bounded, (21) easily follows.

5.5 Proof of Theorem 2.18.

The argument is similar to the proof of Theorem 1. Let $\Phi^{(n)}(\cdot)$ be the distribution function of $\frac{\bar{\theta}_{1}-\theta_{\ast}}{\sigma_{n}}$ and $\widehat{\Phi}_{N\choose n}(\cdot)$ - the empirical distribution function corresponding to the sample $\left\{W_{J}=\frac{\bar{\theta}_{J}-\theta_{\ast}}{\sigma_{n}},\ J\in\mathcal{A}_{N}^{(n)}\right\}$ of size $N\choose n$ .

Suppose that $z\in\mathbb{R}$ is fixed, and note that $\widehat{\Phi}_{N\choose n}(z)$ is a U-statistic with mean $\Phi^{(n)}(z)$ . We will apply the concentration inequality for U-statistics (fact 4) with $M=1$ to get that

[TABLE]

with probability $\geq 1-2e^{-2s}$ ; here, we also used the fact that $n=\lfloor N/k\rfloor$ .

Let $z_{1}\geq z_{2}$ be such that $\Phi^{(n)}(z_{1})\geq\frac{1}{2}+\sqrt{\frac{s}{k}}$ and $\Phi^{(n)}(z_{2})\leq\frac{1}{2}-\sqrt{\frac{s}{k}}$ . Applying (22) for $z=z_{1}$ and $z=z_{2}$ together with the union bound, we see that for $j=1,2$ ,

[TABLE]

on an event $\mathcal{E}$ of probability $\geq 1-4e^{-2s}$ . It follows that on $\mathcal{E}$ , $\mbox{med}\left(W_{J},J\in\mathcal{A}_{N}^{(n)}\right)\in[z_{2},z_{1}]$ . The rest of the proof repeats the argument of section 5.2.

5.6 Proof of Theorem 3.20.

Set $F(z):=\sum_{j=1}^{k}\sum_{i=1}^{m}\rho_{i}(z_{i}-\bar{\theta}_{j,i})$ . Then $\widehat{\theta}^{(k)}=\mathop{\mbox{argmin}}_{z\in\mathbb{R}^{m}}F(z)$ by the definition. Since $F(z)$ is convex, the sufficient and necessary condition for $\widehat{\theta}^{(k)}$ to be its minimizer is that $0\in\partial F(\widehat{\theta}^{(k)})$ , the subdifferential of $F$ at point $z$ . It is easy to see that

[TABLE]

where $\rho^{\prime}_{+,i}(x):=\lim_{t\searrow 0}\frac{\rho_{i}(x+t)-\rho_{i}(x)}{t}$ and $\rho^{\prime}_{-,i}(x):=\lim_{t\nearrow 0}\frac{\rho_{i}(x+t)-\rho_{i}(x)}{t}$ are the right and left derivative of $\rho_{i}$ at point $x$ respectively.

Since the subdifferential is convex, it suffices to find points $z_{i,1},z_{i,2},\ i=1,\ldots,m$ such that for all $i$ ,

[TABLE]

This task has already been accomplished in the proof of Theorem 2.13: since $\rho_{+,i},\ \rho_{-,i},\ i=1,\ldots,m$ are nondecreasing functions, repeating the argument of section 5.3 yields that, on an event of probability $\geq 1-4e^{-2s}$ , inequalities (23) hold with

[TABLE]

We have thus shown that for each $i=1,\ldots,m$ ,

[TABLE]

with probability $\geq 1-4e^{-2s}$ . Applying the union bound over all $i$ , we obtain the result.

5.7 Proof of Lemma 2.3.

It is a simple numerical fact that whenever

[TABLE]

$\zeta_{j}(n_{j},s)/\sigma_{n_{j}}^{(j)}\leq 1$ (indeed, this follows since $\Phi(1)\simeq 0.8413>1/2+0.33$ ). Set $B(s):=\frac{1}{k}\sum_{j=1}^{k}\left(g_{j}(n_{j})+\sqrt{\frac{s}{k}}\right)$ for brevity. Since $e^{-y^{2}/2}\geq 1-\frac{y^{2}}{2}$ , we have

[TABLE]

where the last inequality follows since $\zeta_{j}(n_{j},s)/\sigma_{n_{j}}^{(j)}\leq 1$ . Equation (25) implies that $\frac{\zeta_{j}(n_{j},s)}{\sigma_{n_{j}}^{(j)}}\leq\frac{6}{5}\alpha_{j}\sqrt{2\pi}B(s)$ . Proceeding again as in (25), we see that

[TABLE]

hence $\frac{\zeta_{j}(n_{j},s)}{\sigma_{n_{j}}^{(j)}}\leq\frac{\sqrt{2\pi}}{1-1.51\,\alpha_{j}^{2}\left(B(s)\right)^{2}}\,\alpha_{j}B(s).$ The claim follows since $\alpha_{j}B(s)\leq 0.33$ for all $j$ by assumption, and $\sigma_{n_{j}}^{(j)}\alpha_{j}\equiv H_{k}$ .

Acknowledgements

Authors would like to thank Anatoli Juditsky for many insightful comments and suggestions.

Appendix A Central limit theorem for the non-i.i.d. data.

We present an extension of Theorem 2.15 to non-i.i.d. data for the estimator $\widehat{\theta}^{(k)}=\mbox{med}\left(\bar{\theta}_{1},\ldots,\bar{\theta}_{k}\right)$ that holds under the following assumptions:

$X_{1},\ldots,X_{N}$ are independent, $\mathrm{card}(G_{j})=n_{j}$ , and $\sum_{j=1}^{k}n_{j}=k$ ; 2. 2.

Assumption 1 is satisfied with some $\{\sigma_{n}^{(j)}\}_{n\geq 1}$ and $g_{j}(n)$ , $j=1,\ldots,k$ ; 3. 3.

$k\to\infty$ and $\max_{j=1,\ldots,k}\sqrt{k}\cdot g_{j}(n_{j})\to 0$ as $N\to\infty$ ; 4. 4.

$\max_{j\leq k}\frac{H_{k}}{\sigma_{n_{j}}^{(j)}\sqrt{k}}\xrightarrow{N\to\infty}0$ , where $H_{k}:=\left(\frac{1}{k}\sum_{j=1}^{k}\frac{1}{\sigma_{n_{j}}^{(j)}}\right)^{-1}$ is the harmonic mean of $\sigma_{n_{j}}^{(j)}$ ’s.

Theorem A.27.

Under assumptions (a)-(e) above,

[TABLE]

Proof A.28.

Define $d_{-}(x):=I\left\{x>0\right\}-I\left\{x\leq 0\right\}$ , and $Y_{n_{j},j}=d_{-}\left(\theta_{\ast}-\bar{\theta}_{j}+t\sqrt{\frac{\pi}{2}}\frac{H_{k}}{\sqrt{k}}\right)$ . We will show that

$\frac{1}{k}\sum_{j=1}^{k}\sqrt{k}\mathbb{E}Y_{n_{j},j}\to t$ * as $N\to\infty$ ;* 2. 2.

$\frac{1}{k}\sum_{j=1}^{k}\mbox{Var}(Y_{n_{j},j})\to 1$ * as $N\to\infty$ .*

To prove the first claim, first assume that $t\neq 0$ (for $t=0$ the argument follows the same line with simplifications), and observe that

[TABLE]

Moreover,

[TABLE]

while under assumption (d),

[TABLE]

It then follows from assumption (c) that

[TABLE]

Claim (b) follows since $\mathbb{E}\left(Y_{n_{j},j}\right)^{2}=1$ and $\max_{j\leq k}\mathbb{E}Y_{n_{j},j}\to 0$ under assumption (d).

The rest of the argument repeats the proof of Theorem 2.15 for $\rho(x)=|x|$ .

Appendix B Supplementary results.

Lemma B.29.

Let $\mathcal{A}\subset\mathbb{R}$ be symmetric, meaning that $\mathcal{A}=-\mathcal{A}$ , and let $Z\sim N(0,1)$ . Then for all $x\in\mathbb{R}$ ,

[TABLE]

Proof B.30.

Observe that

[TABLE]

and the claim follows.

Lemma B.31.

Inequality $\tanh(x)\geq x\left(\frac{1+x}{1+x+x^{2}}\right)$ holds for all $x\geq 0$ . Moreover, if $\tanh(x)\leq 1/2$ and $x\geq 0$ , then $\tanh(x)\geq 0.83x$ .

Proof B.32.

Since $e^{x}\geq 1+x+\frac{x^{2}}{2}$ for all $x\geq 0$ ,

[TABLE]

Note that $f(x)=\frac{1+x}{1+x+x^{2}}$ is decreasing on $[0,\infty)$ . Whenever $\tanh(x)\leq 1/2$ , $x\leq\frac{\log 3}{2}\leq 0.55$ , hence $\tanh(x)\geq 0.83x$ .

Appendix C Results for the spatial median with respect to the $\|\cdot\|_{2}$ norm.

In this section, we discuss estimation of the multivariate parameter $\theta_{\ast}\in\mathbb{R}^{m}$ based on the $L_{2}$ -median. Let $X_{1},\ldots,X_{N}\in\mathbb{R}^{d}$ be i.i.d. copies of $X$ randomly partitioned into disjoint groups $G_{1},\ldots,G_{k}$ of cardinality $n\geq\lfloor N/k\rfloor$ each, and let $\bar{\theta}_{j}:=\bar{\theta}_{j}(G_{j})\in\mathbb{R}^{m},\ 1\leq j\leq k$ be a sequence of i.i.d. estimators of $\theta_{\ast}$ . We define

[TABLE]

be the $L_{2}$ median of $\bar{\theta}_{1},\ldots,\bar{\theta}_{k}$ .

Let $Z\in\mathbb{R}^{m}$ have multivariate normal distribution $N(0,\Sigma)$ , and define $\Phi_{\Sigma}(A):=\mathbb{P}{\left(Z\in A\right)}$ for a Borel measurable set $A\subseteq\mathbb{R}^{m}$ . Moreover, define $\mathcal{S}$ to be the set of closed cones,

[TABLE]

We will assume that $\bar{\theta}_{1}$ is “asymptotically normal on cones”:

Assumption 2

There exists a sequence $\{\sigma_{n}\}_{n\in\mathbb{N}}\subset\mathbb{R}_{+}$ and a positive-definite matrix $\Sigma$ such that $\left\|\Sigma\right\|\leq 1$ and

[TABLE]

Theorem C.33.

Let assumption 2 be satisfied. Then with probability $\geq 1-e^{-2s}$ ,

[TABLE]

where

[TABLE]

and $C_{2}(m)=\sqrt{m+2\sqrt{(m-1)\ln 4}}$ .

Remark C.34.

It follows from Lemma B.31 that whenever the right-hand side of the inequality (28) is bounded by $1/2$ , $\tanh\left(\frac{1}{\sigma_{n}}\left\|\widehat{\theta}^{(k)}-\theta_{\ast}\right\|_{2}\right)\geq\frac{0.83}{\sigma_{n}}\left\|\widehat{\theta}^{(k)}-\theta_{\ast}\right\|_{2}$ , which leads to a more explicit bound for $\left\|\widehat{\theta}^{(k)}-\theta_{\ast}\right\|_{2}$ .

As an example, we consider the problem of the multivariate mean estimation. Recall that the condition number $\mathrm{cond}(A)$ of a non-singular matrix $A$ is defined as $\mathrm{cond}(A)=\|A\|\,\|A^{-1}\|$ .

Corollary C.35.

Let $X_{1},\ldots,X_{N}$ be a sequence of i.i.d. copies of a random vector $X\in\mathbb{R}^{d}$ such that $\mathbb{E}X=\theta_{\ast}$ , $\mathbb{E}\left[(X-\theta_{\ast})(X-\theta_{\ast})^{T}\right]=\widetilde{\Sigma}$ , and $\mathbb{E}\|X-\theta_{\ast}\|_{2}^{3}<\infty$ . Define

[TABLE]

Assume that $s>0$ and $k\leq N/2$ are such that

[TABLE]

Then

[TABLE]

with probability $\geq 1-e^{-2s}$ , where $C_{1}(d)$ and $C_{2}(d)$ are the same as in Theorem C.33.

Proof C.36.

It follows from the multivariate Berry-Esseen bound (fact 5) that assumption 2 is satisfied with $\sigma_{n}=\sqrt{\frac{\|\widetilde{\Sigma}\|}{n}}$ , $\Sigma=\frac{\widetilde{\Sigma}}{\|\widetilde{\Sigma}\|}$ and $g_{\mathcal{S}_{d}}(n)=\frac{400d^{1/4}\mathbb{E}\left\|\widetilde{\Sigma}^{-1/2}X\right\|_{2}^{3}}{\sqrt{n}}$ . Noting that $\|\Sigma^{-1/2}\|=\|\widetilde{\Sigma}^{1/2}\|\,\|\widetilde{\Sigma}^{-1/2}\|=\mathrm{cond}(\widetilde{\Sigma}^{1/2})$ , it is easy to deduce the bound from (28) and remark C.34.

Remark C.37.

Note that, similarly to the case $d=1$ , whenever $k\lesssim\sqrt{N}$ (hence, $n\gtrsim\sqrt{N}$ ), the bound of Corollary C.35 is of order $N^{-1/2}$ with respect to the sample size $N$ . However, dependence of the bound on the dimension factor $d$ is suboptimal.

C.1 Overview of numerical algorithms.

Letting $x_{1},\ldots,x_{k}\in\mathbb{R}^{d}$ , $F(z):=\sum\limits_{j=1}^{k}\|z-x_{j}\|$ is convex and it achieves its minimum at a unique point (unless $\{x_{1},\ldots,x_{k}\}$ are on the same line (Kemperman, 1987)) that belongs to the convex hull of $x_{1},\ldots,x_{k}$ .

The classical algorithm that approximates $\mathop{\mbox{argmin}}_{z\in\mathbb{H}}F(z)$ is the famous Weiszfeld’s algorithm (Weiszfeld, 1936): starting from some $z_{0}$ in the affine hull of $\{x_{1},\ldots,x_{k}\}$ , iterate

[TABLE]

where $\alpha^{(j)}_{m+1}=\frac{\|x_{j}-z_{m}\|_{\mathbb{H}}^{-1}}{\sum\limits_{j=1}^{k}\|x_{j}-z_{m}\|_{\mathbb{H}}^{-1}}$ . H. W. Kuhn (Kuhn, 1973) showed that Weiszfeld’s algorithm converges to the geometric median for all but countably many initial points. It is easy to check that (29) is a gradient descent scheme: indeed, it is equivalent to

[TABLE]

where $\beta_{m+1}=\frac{1}{\sum\limits_{j=1}^{k}\|x_{j}-z_{m}\|_{\mathbb{H}}^{-1}}$ and $g_{m+1}=\sum\limits_{j=1}^{k}\frac{z_{m}-x_{j}}{\|z_{m}-x_{j}\|_{\mathbb{H}}}$ is the gradient of $F$ (we assume here that $z_{m}\notin\{x_{1},\ldots,x_{k}\}$ ).

Various improvements and accelerated versions of Weiszfeld’s algorithm have been proposed and analyzed. Ostresh (1978) provides a modified version of Weiszfeld’s algorithm that converges to the geometric median under reasonable initialization conditions, but the rate of convergence is not specified. Kärkkäinen and Ayrämö (2005) consider empirical behavior of several modifications of Weiszfeld’s algorithm, and obtains convergence for an SOR method. Vardi and Zhang (2000) demonstrate convergence of another modified Weiszfeld algorithm, but only provides empirical convergence rates. Overton (1983) provides an algorithm that exhibits quadratic convergence under some assumptions, but a quantitative rate is not expressed. Cardot et al. (2013) develops an online stochastic descent algorithms and provides an asymptotic convergence rate. Quantitative error bounds are not available for any of the algorithms discussed so far.

Literature from computer science considers the computational complexity of algorithms for computing $\widetilde{\theta}^{(k)}$ such that $F(\widetilde{\theta}^{(k)})$ is close to the minimum value $F(\widehat{\theta}^{(k)})$ . A thorough comparison of such results is provided by Cohen et al. (2016). The results from this work are fully quantitative, but they need to be adapted to our setting. In our statistical estimation setting, we are using $\widehat{\theta}^{(k)}$ to estimate the true parameter $\theta^{\ast}$ , so we want bounds on the proximity $\|\widetilde{\theta}^{(k)}-\widehat{\theta}^{(k)}\|$ instead of bounds on $F(\widetilde{\theta}^{(k)})$ . The following theorem (proven in Section D.3) provides a “local lower bound.”

Theorem C.38.

Suppose $\{x_{i}\}_{i=1}^{k}$ , let $\overline{x}=\frac{1}{k}\sum_{i=1}^{k}x_{i}$ , set $m_{t}=\frac{1}{k}\sum_{i=1}^{k}\|x_{i}-\overline{x}\|^{t}$ for $t=1,2,3$ , and assume that the empirical covariance matrix $\widehat{\Sigma}=\frac{1}{k}\sum_{i=1}^{k}(x_{i}-\overline{x})(x_{i}-\overline{x})^{T}$ satisfies

[TABLE]

where $\lambda_{j}(\widehat{\Sigma})$ are the eigenvalues of $\widehat{\Sigma}$ listed with multiplicity and in non-increasing order. Then, for all $\theta\in\mathbb{R}^{d}$ ,

[TABLE]

where

[TABLE]

Theorem C.38 allows us to infer proximity bounds from all the computer science literature that discusses value bounds. Moreover, this bound is asymptotically stable in the i.i.d. sampling setting assuming the existence of three moments. For small $\|\theta-\widehat{\theta}^{(k)}\|$ , the lower bound is approximately quadratic, and hence the proximity bound behaves like $\sqrt{F(\theta)-F(\widehat{\theta}^{(k)})}$ . On the other hand, this local lower bound fits in well with the theory of Restarted Gradient Descent (Yang and Lin, 2015).

Appendix D Proofs of results in Appendix C.

D.1 Technical background.

Everywhere below, $\Phi_{\Sigma}$ stands for the distribution of the normal vector with mean 0 and covariance matrix $\Sigma$ . The following multivariate version of the Berry-Esseen Theorem for convex sets has been established by Bentkus (2003).

Fact 5 (Multivariate Berry-Esseen bound).

Assume that $Y_{1},\ldots,Y_{n}$ is a sequence of i.i.d. copies of a random vector $Y\in\mathbb{R}^{d}$ with mean $\mu$ , covariance matrix $\Sigma\succ 0$ and such that $\mathbb{E}\|Y\|_{2}^{3}<\infty$ . Let $Z$ have normal distribution $N(0,\Sigma)$ , and $\mathcal{A}$ be the class of all convex subsets of $\mathbb{R}^{d}$ . Then

[TABLE]

where $\bar{Y}=\frac{1}{n}\sum_{j=1}^{n}Y_{j}$ .

Given a metric space $(T,\rho)$ , the covering number $N(T,\rho,\varepsilon)$ is defined as the smallest $N\in\mathbb{N}$ such that there exists a subset $F\subseteq T$ of cardinality $N$ with the property that for all $z\in T$ , $\rho(z,F)\leq\varepsilon$ . When metric $\rho$ is clear from the context, we will simply write $N(T,\varepsilon)$ .

Let $\left\{Y(t),\ t\in T\right\}$ be a stochastic process indexed by $T$ . We will say that it has sub-Gaussian increments with respect to metric $\rho$ if for all $t_{1},t_{2}\in\mathbb{T}$ and $s>0$ ,

[TABLE]

Fact 6 (Dudley’s entropy bound).

Let $\{Y(t),\ t\in T\}$ be a centered stochastic process with sub-Gaussian increments. Then the following inequality holds:

[TABLE]

where $D(T)$ is the diameter of the space $T$ with respect to $\rho$ .

Proof D.39.

See (Talagrand, 2005).

Finally, we recall two useful facts related to Vapnik-Chervonenkis (VC) combinatorics (see van der Vaart and Wellner, 1996, for the definition of VC dimension and related theory). Let $\mathcal{F}$ be a finite-dimensional vector space of real functions on $S$ .

Fact 7.

Let $\mathcal{C}=\left\{\{f\geq 0\}:\ f\in\mathcal{F}\right\}$ and $\mathcal{C}_{+}=\left\{\{f>0\}:\ f\in\mathcal{F}\right\}$ Then

[TABLE]

Proof D.40.

See Proposition 3.6.6 in (Giné and Nickl, 2015).

Fact 8.

Let $\mathcal{C}$ be a class of sets of VC-dimension $V$ . Then, for any probability measure $Q$ ,

[TABLE]

for all $0<\varepsilon\leq 1$ ;

Proof D.41.

This bound follows from results of R. Dudley (Dudley, 1978) and D. Haussler (Haussler, 1995). The bound with explicit constants as stated above is given in (Pollard, 2000).

D.2 Proof of Theorem C.33.

By the definition of the geometric median,

[TABLE]

hence

[TABLE]

Set $F_{k}(z):=\sum_{j=1}^{k}\left\|z-\frac{1}{\sigma_{n}}\left(\bar{\theta}_{j}-\theta_{\ast}\right)\right\|_{2}$ . Then (32) is equivalent to

[TABLE]

Denote by $\Phi^{(n)}$ the distribution of $\frac{1}{\sigma_{n}}\left(\bar{\theta}_{1}-\mu\right)$ , and by $\Phi^{(n)}_{k}$ - the empirical distribution corresponding to the sample

[TABLE]

Let $DF_{k}(\widehat{\mu}^{(k)};u):=\lim_{t\searrow 0}\frac{F_{k}(\widehat{\mu}^{(k)}+tu)-F_{k}(\widehat{\mu}^{(k)})}{t}$ be the directional derivative of $F_{k}$ at point $\widehat{\mu}^{(k)}$ in direction $u$ . Clearly, $DF_{k}(\widehat{\mu}^{(k)};u)\geq 0$ for any $u$ such that $\|u\|_{2}=1$ . On the other hand, it is easy to check that $DF_{k}(\widehat{\mu}^{(k)};u)=\Phi^{(n)}_{k}f_{u,\widehat{\mu}^{(k)}}$ , where

[TABLE]

Let $\mathcal{S}_{m}$ be the set of closed cones defined in (27), and note that for any unit vector $u\in\mathbb{R}^{m}$ and $t\in[0,1]$ ,

[TABLE]

Next, observe that

[TABLE]

We will assume that $u$ is chosen such that $\Phi_{\Sigma}\,f_{u,\widehat{\mu}^{(k)}}\leq 0$ (if not, simply replace $u$ by $-u$ ). Then (34) implies that

[TABLE]

It remains to estimate the left-hand side of inequality (35) from below and its right-hand side from above. We start by finding an upper bound (proved in section D.2.1) for $\left|(\Phi^{(n)}-\Phi_{\Sigma})\,f_{u,\widehat{\mu}^{(k)}}\right|$ .

Lemma D.42.

The following bound holds:

[TABLE]

where $g_{\mathcal{S}_{m}}(n)$ was defined in assumption 2.

The next Lemma (proved in section D.2.2) provides an upper bound for $\left|(\Phi^{(n)}_{k}-\Phi^{(n)})f_{u,\widehat{\mu}^{(k)}}\right|$ .

Lemma D.43.

With probability $\geq 1-e^{-2s}$ ,

[TABLE]

Finally, it remains to estimate $\Phi_{\Sigma}\,f_{-u,\widehat{\mu}^{(k)}}$ from below. The following inequality (proved in section D.2.3) holds:

Lemma D.44.

Set $u=-\frac{\Sigma^{-1}\widehat{\mu}^{(k)}}{\|\Sigma^{-1}\widehat{\mu}^{(k)}\|_{2}}$ . Then

[TABLE]

where $\tanh(\cdot)$ is the hyperbolic tangent defined as $\tanh(x)=\frac{1-e^{-2x}}{1+e^{-2x}}$ .

It therefore follows from Lemmas D.42, D.43 and D.44 that with probability exceeding $1-e^{-2s}$ ,

[TABLE]

which implies the bound of Theorem C.33.

D.2.1 Proof of Lemma D.42.

Recall that for any non-negative function $f:\mathbb{R}^{m}\mapsto\mathbb{R}_{+}$ and a signed measure $Q$ ,

[TABLE]

Hence

[TABLE]

where we used the identity $-f_{u,\widehat{\mu}^{(k)}}=f_{-u,\widehat{\mu}^{(k)}}$ . Next, it follows from (33) that

[TABLE]

by assumption 2. It implies that $\left|(\Phi^{(n)}-\Phi_{\Sigma})\,f_{u,\widehat{\mu}^{(k)}}\right|\leq 2g_{\mathcal{S}_{m}}(n),$ as claimed.

D.2.2 Proof of Lemma D.43.

Using (36) and proceeding as in the proof of Lemma D.42, we obtain that

[TABLE]

It follows from the bounded difference inequality (fact 3) that for all $s>0$ ,

[TABLE]

hence it is enough to control $\mathbb{E}\sup_{A\in\mathcal{S}_{m}}\left|\Phi^{(n)}_{k}(A)-\Phi^{(n)}(A)\right|$ . To this end, we will estimate the covering numbers of the class of cones $\mathcal{S}$ and use Dudley’s integral bound (fact 6).

Given a vector $\mathbf{x}\in\mathbb{R}^{m}$ , let $x_{1},\ldots,x_{m}$ be its coordinates with respect to the standard Euclidean basis. Note that

[TABLE]

which is equivalent to $\sum_{i,j=1}^{m}\alpha_{i}\alpha_{i,j}x_{i}x_{j}+\sum_{j=1}^{m}\beta_{j}x_{j}+\gamma\geq 0$ and $\left\langle x-b,u\right\rangle\geq 0$ , where $\alpha_{i,j},\ \beta_{j},\ i,j=1,\ldots,m$ , and $\gamma$ are functions of $t,\ b_{j}$ and $u_{j}$ , $j=1,\ldots,m$ . In particular, every element of $A\in\mathcal{S}_{m}$ is the intersection of a half-space $\left\{\mathbf{x}:\ \left\langle\mathbf{x}-b,u\right\rangle\geq 0\right\}$ and a set $\left\{\mathbf{x}:\ f(\mathbf{x})\geq 0\right\}$ , where $f$ is a polynomial of degree $2$ in $m$ variables. The dimension of the space $V_{2,m}$ of polynomials of degree at most $2$ is $\dim(V_{2,m})={m+2\choose 2}$ , hence the Vapnik-Chernonenkis dimension of the collection of sets $\mathcal{S}_{V_{2,m}}=\Big{\{}\left\{x:f(x)\geq 0\right\},\ f\in V_{2,m}\Big{\}}$ is $\tilde{m}:={m+2\choose 2}$ by fact 7. It follows from fact 8 that for any probability measure $Q$ ,

[TABLE]

for all $0<\varepsilon\leq 1$ . It is also well known that (and can be deduced from the similar reasoning) that the VC-dimension of a collection $\mathcal{S}_{L}$ of halfspaces of $\mathbb{R}^{m}$ is $m+1$ , hence

[TABLE]

Given two collections of sets $\mathcal{C}_{1},\ \mathcal{C}_{2}$ , let $A^{(1)}_{1},\ldots,A^{(1)}_{N(\mathcal{C}_{1},L_{2}(Q),\varepsilon)}$ and $A^{(2)}_{1},\ldots,A^{(2)}_{N(\mathcal{C}_{2},L_{2}(Q),\varepsilon)}$ be the $L_{2}(Q)$ $\varepsilon$ - nets of smallest cardinality for the classes of functions $\left\{I_{A}:\ A\in\mathcal{C}_{1}\right\}$ and $\left\{I_{A}:\ A\in\mathcal{C}_{2}\right\}$ respectively. Let $A^{\prime}\in\mathcal{C}_{1},\ A^{\prime\prime}\in\mathcal{C}_{2}$ , and assume without loss of generality that $\|A^{\prime}-A^{(1)}_{1}\|_{L_{2}(Q)}\leq\varepsilon$ and $\|A^{\prime\prime}-A^{(2)}_{1}\|_{L_{2}(Q)}\leq\varepsilon$ . Then

[TABLE]

which implies that the covering number of the class $\mathcal{D}=\left\{I_{A_{1}}I_{A_{2}},\ A_{1}\in\mathcal{C}_{1},\ A_{2}\in\mathcal{C}_{2}\right\}$ corresponding to intersections of elements of $\mathcal{C}_{1}$ and $\mathcal{C}_{2}$ satisfies

[TABLE]

In particular, the metric entropy of the class of cones $\mathcal{S}_{m}$ can be bounded as

[TABLE]

uniformly over all probability measures $Q$ , hence fact 6 implies that

[TABLE]

D.2.3 Proof of Lemma D.44.

Making the change of variables $x=\Sigma^{1/2}z$ , we obtain

[TABLE]

where $\tilde{u}=\frac{\Sigma^{1/2}u}{\|\Sigma^{1/2}u\|_{2}}$ . Let $\kappa:=\left\|\Sigma^{-1/2}\widehat{\mu}^{(k)}\right\|_{2}$ , and note that $\kappa\geq\left\|\widehat{\mu}^{(k)}\right\|_{2}$ since $\|\Sigma\|\leq 1$ by assumption. Let $V$ be any orthogonal transformation that maps $\Sigma^{-1/2}\widehat{\mu}^{(k)}$ to $\kappa e_{1}$ (here, $e_{1},\ldots,e_{m}$ is the standard Euclidean basis of $\mathbb{R}^{m}$ ). Then, letting $y=V(z-\Sigma^{-1/2}\widehat{\mu}^{(k)})$ , we observe that

[TABLE]

Setting $u=-\frac{\Sigma^{-1}\widehat{\mu}^{(k)}}{\|\Sigma^{-1}\widehat{\mu}^{(k)}\|_{2}}$ , we obtain from the last inequality that

[TABLE]

Set $y=(-t,z)$ , where $t\in\mathbb{R}$ and $z\in\mathbb{R}^{m-1}$ . We will also let $\phi_{k}$ denote the density (with respect to Lebesgue measure) of the standard normal distribution on $\mathbb{R}^{k}$ . Then

[TABLE]

Setting $h(t,z)=t/\sqrt{t^{2}+\|z\|_{2}^{2}}$ , we have that

[TABLE]

Now, for any $t\geq 0$ ,

[TABLE]

hence

[TABLE]

where we have use the inequality $h(t,z)\geq(1+R^{2})^{-1/2}$ whenever $\|z\|^{2}_{2}\leq R$ and $t\geq 1$ , and $1-\Phi(1)>0.15$ . Finally, a well-known bound states that if $Y$ has $\chi_{m-1}^{2}$ distribution, then for all $t>0$

[TABLE]

For $R^{2}:=m-1+2\sqrt{(m-1)\ln 4}$ , it implies that

[TABLE]

which concludes the proof.

D.3 Proof of Theorem C.38.

To simplify notation in what follows, we let $z^{\ast}=\mathop{\mbox{argmin}}_{z\in\mathbb{R}^{d}}F(z)$ . We let $f_{i}(z)=\|z-x_{i}\|$ for all $i=1,\ldots,k$ and observe that a weak gradient of $f_{i}(z)$ is given by

[TABLE]

Hence, $\nabla F(z)=\sum_{i=1}^{k}\nabla f_{i}(z)$ is a weak gradient of $F$ .

Now, fix $z\in\mathbb{R}^{d}$ with $z\not=z^{\ast}$ , let $r=\|z-z^{\ast}\|$ , and set $u=\frac{1}{r}(z-z^{\ast})$ . The second fundamental theorem of calculus yields

[TABLE]

In this last line, we have set $\gamma_{i}=\|z^{\ast}-x_{i}\|$ and $c_{i}=\frac{1}{\gamma_{i}}(z^{\ast}-x_{i})^{T}u$ . By Cauchy-Schwarz, we have that $c_{i}^{2}\leq 1$ . If $c_{i}^{2}=1$ , then

[TABLE]

for all $t\geq 0$ . If $c_{i}^{2}<1$ , then we have that

[TABLE]

Note that $\sum_{i=1}^{k}c_{i}=\nabla F(z^{\ast})^{T}u=0$ since $z^{\ast}$ is the minimizer. Consequently, we have

[TABLE]

Given that

[TABLE]

we obtain the lower bound

[TABLE]

Noting that the inverse cubic function is convex, Jensen’s inequality and straightforward integration yields

[TABLE]

We now observe that

[TABLE]

and also that

[TABLE]

where $\{u,u_{2},\ldots,u_{d}\}$ is an orthonormal basis for $\mathbb{R}^{d}$ . We further observe that

[TABLE]

The Courant-Fischer characterization of eigenvalues gives us

[TABLE]

where $\{\lambda_{j}(\widehat{\Sigma})\}_{j=1}^{d}$ are the eigenvalues of the empirical covariance matrix listed with multiplicity and in non-increasing order. We therefore have

[TABLE]

and the result follows.

Bibliography69

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alon et al. (1996) Alon, N., Matias, Y. and Szegedy, M. (1996) The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , 20–29. ACM.
2Arcones (1996) Arcones, M. A. (1996) The Bahadur-Kiefer representation for U-quantiles. The Annals of Statistics , 24 , 1400–1422.
3Battey et al. (2015) Battey, H., Fan, J., Liu, H., Lu, J. and Zhu, Z. (2015) Distributed estimation and inference with statistical guarantees. ar Xiv preprint ar Xiv:1509.05457 .
4Bentkus (2003) Bentkus, V. (2003) On the dependence of the berry–esseen bound on dimension. Journal of Statistical Planning and Inference , 113 , 385–402.
5Bentkus et al. (1997) Bentkus, V., Bloznelis, M. and Götze, F. (1997) A Berry–Esséen bound for M-estimators. Scandinavian journal of statistics , 24 , 485–502.
6Berry (1941) Berry, A. C. (1941) The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the american mathematical society , 49 , 122–136.
7Bickel et al. (1965) Bickel, P. J. et al. (1965) On some robust estimates of location. The Annals of Mathematical Statistics , 36 , 847–858.
8Bubeck et al. (2013) Bubeck, S., Cesa-Bianchi, N. and Lugosi, G. (2013) Bandits with heavy tail. IEEE Transactions on Information Theory , 59 , 7711–7717.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

Abstract

1 Introduction.

1.1 Background and related work.

1.2 Organization of the paper.

1.3 Notation.

1.4 Main results.

2 The univariate case.

Assumption 1

2.1 Merging procedure based on the median.

Theorem 1

Proof 2.2**.**

Lemma 2.3**.**

Proof 2.4**.**

Remark 2.5**.**

2.2 Example: new bounds for the median-of-means estimator.

Corollary 2.6**.**

Remark 2.7**.**

Proof 2.8**.**

Corollary 2.9**.**

2.3 Example: distributed maximum likelihood estimation.

Corollary 2.10**.**

Proof 2.11**.**

Remark 2.12**.**

2.4 Merging procedures based on robust M-estimators.

Theorem 2.13**.**

Proof 2.14**.**

2.5 Asymptotic results.

Theorem 2.15**.**

Proof 2.16**.**

Remark 2.17**.**

2.6 Connections to U-quantiles.

Theorem 2.18**.**

Proof 2.19**.**

3 Estimation in higher dimensions.

Theorem 3.20**.**

Proof 3.21**.**

3.1 Example: multivariate median-of-means estimator.

Corollary 3.22**.**

Proof 3.23**.**

Remark 3.24**.**

4 Simulation results.

5 Proofs

5.1 Preliminaries.

Fact 1** (Berry-Esseen bound).**

Fact 2** (Generalization of Berry-Esseen bound).**

Fact 3** (Bounded difference inequality).**

Fact 4** (Concentration inequality for U-statistics, (Hoeffding, 1963)).**

5.2 Proof of Theorem 1.

5.3 Proof of Theorem 2.13.

5.4 Proof of Theorem 2.15.

Lemma 5.25**.**

Proof 5.26** (of Lemma 5.25).**

5.5 Proof of Theorem 2.18.

5.6 Proof of Theorem 3.20.

5.7 Proof of Lemma 2.3.

Acknowledgements

Appendix A Central limit theorem for the non-i.i.d. data.

Theorem A.27**.**

Proof A.28**.**

Appendix B Supplementary results.

Lemma B.29**.**

Proof B.30**.**

Lemma B.31**.**

Proof B.32**.**

Appendix C Results for the spatial median with respect to the ∥⋅∥2\|\cdot\|_{2}∥⋅∥2​ norm.

Assumption 2

Theorem C.33**.**

Remark C.34**.**

Corollary C.35**.**

Proof C.36**.**

Remark C.37**.**

C.1 Overview of numerical algorithms.

Theorem C.38**.**

Proof 2.2.

Lemma 2.3.

Proof 2.4.

Remark 2.5.

Corollary 2.6.

Remark 2.7.

Proof 2.8.

Corollary 2.9.

Corollary 2.10.

Proof 2.11.

Remark 2.12.

Theorem 2.13.

Proof 2.14.

Theorem 2.15.

Proof 2.16.

Remark 2.17.

Theorem 2.18.

Proof 2.19.

Theorem 3.20.

Proof 3.21.

Corollary 3.22.

Proof 3.23.

Remark 3.24.

Fact 1 (Berry-Esseen bound).

Fact 2 (Generalization of Berry-Esseen bound).

Fact 3 (Bounded difference inequality).

Fact 4 (Concentration inequality for U-statistics, (Hoeffding, 1963)).

Lemma 5.25.

Proof 5.26 (of Lemma 5.25).

Theorem A.27.

Proof A.28.

Lemma B.29.

Proof B.30.

Lemma B.31.

Proof B.32.

Appendix C Results for the spatial median with respect to the $\|\cdot\|_{2}$ norm.

Theorem C.33.

Remark C.34.

Corollary C.35.

Proof C.36.

Remark C.37.

Theorem C.38.

Fact 5 (Multivariate Berry-Esseen bound).

Fact 6 (Dudley’s entropy bound).

Proof D.39.

Fact 7.

Proof D.40.

Fact 8.

Proof D.41.

Lemma D.42.

Lemma D.43.

Lemma D.44.