Distributed Statistical Estimation and Rates of Convergence in Normal Approximation
Stanislav Minsker, Nate Strawn

TL;DR
This paper introduces new distributed statistical estimation algorithms leveraging divide-and-conquer strategies, establishing their convergence rates and robustness, with applications to median-of-means and maximum likelihood estimators.
Contribution
It develops novel algorithms for distributed estimation, linking their performance to normal approximation rates, and provides non-asymptotic deviation bounds and limit theorems.
Findings
New bounds for median-of-means estimator in distributed settings
Performance guarantees for distributed maximum likelihood estimation
Robustness of divide-and-conquer algorithms in large systems
Abstract
This paper presents a class of new algorithms for distributed statistical estimation that exploit divide-and-conquer approach. We show that one of the key benefits of the divide-and-conquer strategy is robustness, an important characteristic for large distributed systems. We establish connections between performance of these distributed algorithms and the rates of convergence in normal approximation, and prove non-asymptotic deviations guarantees, as well as limit theorems, for the resulting estimators. Our techniques are illustrated through several examples: in particular, we obtain new results for the median-of-means estimator, as well as provide performance guarantees for distributed maximum likelihood estimation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Distributed Statistical Estimation and Rates of Convergence in Normal Approximation
Stanislav Minsker 111S. Minsker was partially supported by the National Science Foundation grant DMS-1712956.
Nate Strawn
Abstract
This paper presents a class of new algorithms for distributed statistical estimation that exploit divide-and-conquer approach. We show that one of the key benefits of the divide-and-conquer strategy is robustness, an important characteristic for large distributed systems. We establish connections between performance of these distributed algorithms and the rates of convergence in normal approximation, and prove non-asymptotic deviations guarantees, as well as limit theorems, for the resulting estimators. Our techniques are illustrated through several examples: in particular, we obtain new results for the median-of-means estimator, as well as provide performance guarantees for distributed maximum likelihood estimation.
1 Introduction.
According to (IBM, 2015), “Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos.. to name a few. This data is big data”. Novel scalable and robust algorithms are required to successfully address the challenges posed by big data problems. This paper develops and analyzes techniques that exhibit scalability, a necessary characteristic of modern methods designed to perform statistical analysis of large datasets, as well as robustness that guarantees stable performance of distributed systems when some of the nodes exhibit abnormal behavior.
The computational power of a single computer is often insufficient to store and process modern data sets, and instead data is stored and analyzed in a distributed way by a cluster consisting of several machines. We consider a distributed estimation framework wherein data is assumed to be randomly assigned to computational nodes that produce intermediate results. We assume that no communication between the nodes is allowed at this first stage. On the second stage, these intermediate results are used to compute some statistic on the whole dataset; see figure 1 for a graphical illustration.
Often, such a distributed setting is unavoidable in applications, whence interactions between subsamples stored on different machines are inevitably lost. Most previous research focused on the following question: how significantly does this loss affect the quality of statistical estimation when compared to an “oracle” that has access to the whole sample? The question that we ask in this paper is different: what can be gained from randomly splitting the data across several subsamples? What are the statistical advantages of the divide-and-conquer framework? Our work indicates that one of the key benefits of an appropriate merging strategy is robustness. In particular, the quality of estimation attained by the distributed estimation algorithm is preserved even if a subset of machines stops working properly. At the same time, the resulting estimators admit tight probabilistic guarantees (expressed in the form of exponential concentration inequalities) even when the distribution of the data has heavy tails – a viable model of real-world samples contaminated by outliers.
We establish connections between a class of randomized divide-and-conquer strategies and the rates of convergence in normal approximation. Using these connections, we provide a new analysis of the “median-of-means” estimator which often yields significant improvements over the previously available results. We further illustrate the implications of our results by constructing novel algorithms for distributed Maximum Likelihood Estimation that admit strong performance guarantees under weak assumptions on the underlying distribution.
1.1 Background and related work.
We begin by introducing a simple model for distributed statistical estimation. Let be a sequence of independent random variables with values in a measurable space that represent the data available to a statistician. We will assume that is large, and that that the sample is partitioned into disjoint subsets of cardinalities respectively, where the partitioning scheme is independent of the data. Let be the distribution of , . The goal is to estimate an unknown parameter shared by and taking values in a separable Hilbert space ; for example, if , could be the common mean of . Distributed estimation protocol proceeds via performing “local” computations on each subset , and the local estimators are then pieced together to produce the final “global” estimator . We are interested in the statistical properties of such distributed estimation protocols, and our main focus is on the final step that combines the local estimators. Let us mention that the condition requiring the sets to be disjoint can be relaxed; we discuss the extensions related to U-quantiles in section 2.6 below.
The problem of distributed and communication - efficient statistical estimation has recently received significant attention from the research community. While our review provides only a subsample of the abundant literature in this field, it is important to acknowledge the works by Mcdonald et al. (2009); Zhang et al. (2012); Fan et al. (2014); Battey et al. (2015); Duchi et al. (2014); Shafieezadeh-Abadeh et al. (2015); Lee et al. (2015); Cheng and Shang (2015); Rosenblatt and Nadler (2016); Zinkevich et al. (2010). Li et al. (2016); Scott et al. (2016); Shang and Cheng (2015); Minsker et al. (2014) have investigated closely related problems for distributed Bayesian inference. Applications to important algorithms such as Principal Component Analysis were investigated in (Fan et al., 2017; Liang et al., 2014), among others. Jordan (2013), author provides an overview of recent trends in the intersection of the statistics and computer science communities, describes popular existing strategies such as the “bag of little bootstraps”, as wells as successful applications of the divide-and-conquer paradigm to problems such as matrix factorization.
The majority of the aforementioned works propose averaging of local estimators as a final merging step. Indeed, averaging reduces variance, hence, if the bias of each local estimator is sufficiently small, their average often attains optimal rates of convergence to the unknown parameter . For example, when is the mean of and is the sample mean evaluated over the subsample , then the average of local estimators is just a empirical mean evaluated over the whole sample. More generally, it has been shown by Battey et al. (2015); Zhang et al. (2013) that in many problems (for instance, linear regression), can be taken as large as without negatively affecting the estimation rates; similar guarantees hold for a variety of M-estimators (see Rosenblatt and Nadler, 2016). However, if the number of nodes itself is large (the case we are mainly interested in), then the averaging scheme has a drawback: if one or more among the local estimators ’s is anomalous (for example, due to data corruption or a computer system malfunctioning), then statistical properties of the average will be negatively affected as well. For large distributed systems, this drawback can be costly.
One way to address this issue is to replace averaging by a more robust procedure, such as the median or a robust M-estimator; this approach is investigated in the present work. In the univariate case (, the merging strategies we study can be described as solutions of the optimization problem
[TABLE]
for an appropriately defined convex function ; we investigate this class of estimators in detail. A natural extension to the case is to consider
[TABLE]
for some convex function and norm . For example, if , then becomes the spatial (also known as geometric or Haldane’s) median (Haldane, 1948; Small, 1990) of . Since the median remains stable as long as at least a half of the nodes in the system perform as expected, such model for distributed estimation is robust. The merging approach based on the various notions of the multivariate median has been previously considered by Minsker (2015) and Hsu and Sabato (2016); here, we analyze the setting when and is the -norm using the novel approach.
Existing results for the median-based merging strategies have several pitfalls related to the deviation rates, and in most cases known guarantees are suboptimal. In particular, these guarantees suggest that estimators obtained via the median-based approach are very sensitive to the choice of , the number of partitions. For instance, consider the problem of univariate mean estimation, where are i.i.d. copies of , and is the expectation of . Assume that for all , let be the empirical mean evaluated over the subsample , and define the “median-of-means” estimator via
[TABLE]
where is the usual univariate median. This estimator has been introduced by Nemirovski and Yudin (1983) in the context of stochastic optimization, and later appeared in (Jerrum et al., 1986) and (Alon et al., 1996). If it has been shown (for example, by Lerasle and Oliveira, 2011) that the median-of-means estimator satisfies
[TABLE]
with probability . However, this bound, while being the current state of the art, does not tell us what happens at the confidence levels other than . For example, if , the only conclusion we can make is that with high probability, which is far from the optimal rate . And if we want the bound to hold with confidence 99% instead of , then, according to (3), we should take , in which case the beneficial effect of parallel computation is very limited. The natural question to ask is the following: is the median-based merging step indeed suboptimal for large values of (e.g., ), or is the problem related to the suboptimality of existing bounds? We claim that in many situations the latter is the case, and that previously known results can be strengthened: for instance, the statement of Corollary 2.6 below implies that whenever , the median-of-means estimator satisfies
[TABLE]
with probability , for all . In particular, this inequality shows that the estimator (2) has “typical” deviations of order whenever , hence the “statistical cost” of employing a large number of computational nodes is minor. Moreover, we will prove that if and as . It will also be demonstrated that improved bounds hold in other important scenarios, such as maximum likelihood estimation, even when the subgroups have different sizes and the observations are not identically distributed.
1.2 Organization of the paper.
Section 1.3 describes notation used throughout the paper. Sections 2 and 3 present main results and examples for the cases of univariate and multivariate parameter respectively. Outcomes of numerical simulation are discussed in section 4, and proofs of the main results are contained in section 5.
1.3 Notation.
Everywhere below, and stand for the and norms of a vector, and - for the operator norm of a matrix (its largest singular value).
Given a probability measure , will stand for the expectation with respect to , and we will write when is clear from the context. Convergence in distribution will be denoted by .
For two sequences and for , the expression means that there exists a constant such that for all . Absolute constants will be denoted , etc., and may take different values in different parts of the paper. For a function , we define
[TABLE]
and . Finally, and will denote the right and left derivatives of respectively (whenever these limits exist). Additional notation and auxiliary results are introduced on demand for the proofs in section 5.
1.4 Main results.
As we have argued above, existing guarantees for the estimator (2) are sensitive to the choice of , the number of partitions. In the following sections, we demonstrate that these bounds are often suboptimal, and show that large values of often do not have a significant negative effect on the statistical performance of resulting algorithms.
The key observation underlying the subsequent exposition is the following: assume that the “local estimators” , are asymptotically normal with asymptotic mean equal to . In particular, distributions of ’s are approximately symmetric, with being the center of symmetry. The location parameters of symmetric distributions admits many robust estimators of the form (1), the sample median being a notable example.
This intuition allows us to establish a parallel between the non-asymptotic deviation guarantees for distributed estimation procedures of the form (1) and the degree of symmetry of “local” estimators quantified by the rates of convergence to normal approximation. Results for the univariate case are presented in section 2, and extensions to the multivariate case are presented in section 3.
2 The univariate case.
We assume that is a collection of independent (but not necessarily identically distributed) -valued random variables with distributions respectively. The data are partitioned into disjoint groups of cardinality each, and such that . Let be a sequence of independent estimators of the parameter shared by . Our main assumption will be that are asymptotically normal as quantified by the following condition.
Assumption 1
Let be the cumulative distribution function of the standard normal random variable . For each , there exist a sequence such that
[TABLE]
Clearly, functions , control the rate of convergence of estimators to the normal law. Furthermore, let
[TABLE]
be the harmonic mean of ’s, and set . Note that , and that if .
2.1 Merging procedure based on the median.
In this subsection, we establish guarantees for the merging procedure based on the sample median, namely,
[TABLE]
This case is treated separately due to its practical importance, the fact that we can obtain better numerical constants, and a conceptually simpler proof.
Theorem 1
Assume that and are such that
[TABLE]
Moreover, let assumption 1 be satisfied, and let solve the equation
[TABLE]
Then for all satisfying (4),
[TABLE]
with probability at least .
Proof 2.2**.**
See section 5.2.
The following lemma yields a more explicit form of the bound and numerical constants.
Lemma 2.3**.**
Assume that . Then
[TABLE]
Proof 2.4**.**
See section 5.7.
Remark 2.5**.**
Let be the non-decreasing rearrangement of . It is easy to see that the harmonic mean of satisfies
[TABLE]
for any integer , hence, informally speaking, the deviations of are controlled by the smallest ’s rather than the largest.
2.2 Example: new bounds for the median-of-means estimator.
The univariate mean estimation problem is pervasive in statistics, and serves as a building block of more advanced methods such as empirical risk minimization. Early works on robust mean estimation include Tukey’s “trimmed mean” (Tukey and Harris, 1946), as well as “winsorized mean” (Bickel et al., 1965); also see discussion in (Bubeck et al., 2013). These techniques often produce estimators with significant bias. A different approach based on M-estimation was suggested by O. Catoni (Catoni, 2012); Catoni’s estimator yields almost optimal constants, however, its construction requires additional information about the variance or the kurtosis of the underlying distribution; moreover, its computation is not easily parallelizable, therefore this technique cannot be easily employed in the distributed setting.
Here, we will focus on a fruitful idea that is commonly referred to as the “median-of-means” estimator that was formally defined in equation (2) above. Several refinements and extensions of this estimator to higher dimensions have been recently introduced by Minsker (2015); Hsu and Sabato (2013); Devroye et al. (2016); Joly et al. (2016); Lugosi and Mendelson (2017). Advantages of this method include the facts that that it can be implemented in parallel and does not require prior knowledge of any information about parameters of the distribution (e.g., its variance). The following result for the median-of-means estimator is the corollary of Theorem 1; for brevity, we treat only the i.i.d. case. Recall that and .
Corollary 2.6**.**
Let be a sequence of i.i.d. copies of a random variable such that , , , and set . Then for all such that , the estimator defined in (2) satisfies
[TABLE]
with probability at least .
Remark 2.7**.**
The term can be thought of as the “bias” due to asymmetry of the distribution of the sample mean. Note that whenever (so that ), the right-hand side of the inequality above is of order .
Proof 2.8**.**
It follows from the Berry-Esseen Theorem (fact 1 in section 5.1) that assumption 1 is satisfied with , and
[TABLE]
for all . Lemma 2.3 implies that , and the claim follows from Theorem 1.
For distributions with infinite third moment, the rate of convergence in the Berry-Esseen type bound is slower, and the following result holds instead.
Corollary 2.9**.**
Let be a sequence of i.i.d. copies of a random variable such that , , for some . Then there exist absolute constants such that for all and satisfying , the following inequality holds with probability at least :
[TABLE]
In this case, typical deviations of are still of order as long as . The proof of this result follows from fact 2 in section 5.1 in the same way as Corollary 2.6 was deduced from the Berry-Esseen bound.
2.3 Example: distributed maximum likelihood estimation.
Let be i.i.d. copies of a random vector with distribution , where . Assume that for each , is absolutely continuous with respect to a -finite measure , and let be the corresponding density. In this section, we state sufficient conditions for assumption 1 to be satisfied when are the maximum likelihood estimators (van der Vaart, 1998) of . Conditions stated below were obtained by Pinelis (2016). All derivatives below (denoted by ′) are taken with respect to , unless noted otherwise.
Assume that the the log-likelihood function satisfies the following:
- (1)
for some ; 2. (2)
“standard regularity conditions” that allow differentiation under the expectation: assume that , and that the Fisher information is finite; 3. (3)
; 4. (4)
for -almost all , is three times differentiable for , and ; 5. (5)
for some positive constants and .
In turn, condition (5) above is implied by the following two inequalities (see Pinelis, 2016, section 6.2, for detailed discussion and examples):
, where is the Hellinger distance, and are positive constants; 2. 2.
for some positive constants and and all .
Corollary 2.10**.**
Assume that conditions (1)-(5) are satisfied, and that . Then for all such that ,
[TABLE]
with probability at least , where is a positive constant that depends only on .
Proof 2.11**.**
It follows from results in (Pinelis, 2016), in particular equation (5.5), that whenever conditions (1)-(5) hold, assumption 1 is satisfied for all with , where is the Fisher information, and , where is a constant that depends only on . Lemma 2.3 implies that
[TABLE]
and the claim follows from Theorem 1.
Remark 2.12**.**
Results of this section can be extended to include other M-estimators besides MLEs, as Bentkus et al. (1997) have shown that M-estimators satisfy a variant of Berry-Esseen bound under rather general conditions.
2.4 Merging procedures based on robust M-estimators.
In this subsection, we study the family of merging procedures based on the M-estimators
[TABLE]
The sample median corresponds to the choice of (non-smooth) and was treated separately above; here, it will be assumed that is convex, even, differentiable function such that as and . A particular example of such a function is Huber’s loss
[TABLE]
where is a positive constant. The following result quantifies non-asymptotic performance of the estimator . As before, we set
[TABLE]
where ’s are defined in assumption 1. Moreover, given the loss as above, let be such that for .
Theorem 2.13**.**
Let assumption 1 be satisfied, and suppose that and are such that
[TABLE]
Then for all satisfying (8),
[TABLE]
with probability at least .
Proof 2.14**.**
See section 5.3.
Note that the bound depends on only through . Assume for concreteness that , and that is Huber’s loss defined in (6), so that . For to be bounded above by an absolute constant, one should choose to be of order . While the latter quantity is typically unknown, it can be estimated in some cases. For example, if the data are i.i.d. then for all . Since ’s are approximately normal, their standard deviation can be estimated by the median absolute deviation as
[TABLE]
where the factor is introduced to make the estimator consistent (Hampel et al., 2011); another possibility is to use bootstrap (Ghosh et al., 1984).
2.5 Asymptotic results.
In this section, we complement the previously discussed non-asymptotic deviation bounds for by the asymptotic results. For the benefits of clarity, we state the complete list of assumptions made below:
are i.i.d., and ; result for non-identically distributed data is presented in Appendix A. 2. 2.
Assumption 1 is satisfied for some function (note that there is no dependence on index due to the i.i.d. assumption); 3. 3.
and are such that and as ; 4. 4.
is a convex, even function, such that as and (here, is defined as the average of the right and left derivatives of at ). 5. 5.
is defined as
[TABLE]
where is a normalizing sequence from assumption 1 (our definition of the estimator is slightly different than in section 2.4 which allows to keep fixed as and are changing).
For , define
[TABLE]
where . Note that, since is differentiable almost everywhere, .
Theorem 2.15**.**
Under assumptions (a)-(e) above,
[TABLE]
where .
Proof 2.16**.**
See section 5.4.
For example, if , Theorem 2.15 implies that under appropriate assumptions, the median-of-means estimator defined in (2) satisfies
[TABLE]
Indeed, in this case , where , and
[TABLE]
hence a simple calculation yields .
If we consider the mean estimation problem with Huber’s loss (6) instead of , we similarly deduce that
[TABLE]
and we get the well-known (Huber, 1964) expression ; in particular, as , and the convergence is fast. For instance, for and for .
Remark 2.17**.**
The key assumptions in the list (a)-(e) governing the regime of growth of and are (b) and (c). For instance, if the random variables possess finite moments of order for some , then it follows from fact 2 in section 5.1 that as .
2.6 Connections to U-quantiles.
In this section, we discuss connections of proposed algorithms to U-quantiles and the assumption requiring the groups to be disjoint. We assume that the data are i.i.d. with common distribution , and let be a real-valued parameter of interest. It is clear that the estimators produced by distributed algorithms considered above depend on the random partition of the sample. A natural way to avoid such dependence is to consider the U-quantile (in this case, the median)
[TABLE]
where is a collection of all distinct subsets of of cardinality , and is an estimator of based on . For instance, when and , is the well-known Hodges-Lehmann estimator of the location parameter, see (Hodges and Lehmann, 1963; Lehmann and D’Abrera, 2006); for a comprehensive study of U-quantiles, see (Arcones, 1996). The main result of this section is an analogue of Theorem 1 for the estimator ; it implies that theoretical guarantees for the performance of are at least as good as for the estimator . Since the data are i.i.d., it is enough to impose the assumption 1 on only, hence we drop the index and denote the normalizing sequence and the corresponding error function .
Theorem 2.18**.**
Assume that and are such that
[TABLE]
Moreover, let assumption 1 be satisfied, and let solve the equation
[TABLE]
Then for any satisfying (10),
[TABLE]
with probability at least .
Proof 2.19**.**
See section 5.5. As before, a more explicit form of the bound immediately follows from Lemma 2.3.
A drawback of the estimator is the fact that its exact computation requires evaluation of estimators over subsamples . For large and , such task becomes intractable. However, an approximate result can be obtained by choosing subsets from uniformly at random, and setting . Typically, the error is of order with high probability over the random draw of .
We note that Theorem 2.13 admits a similar extension for the estimator defined as
[TABLE]
Namely, if the data are i.i.d., then under the assumptions of section 2.4,
[TABLE]
with probability at least , whenever and are such that
[TABLE]
We omit the proof of (11) since the required modifications in the argument of Theorem 2.13 are exactly the same as those explained in the proof of Theorem 2.18.
3 Estimation in higher dimensions.
In this section, it will be assumed that , is a vector-valued parameter of interest. Let be independent -valued random variables that are randomly partitioned into disjoint groups of cardinality each. Let be a sequence of estimators of , the common parameter of the distributions of ’s. Assume that are convex, even functions such that as and , with defined as the average of the right and left derivatives of , , and let
[TABLE]
where and for .
For the sake of clarity, we will assume below that are i.i.d. However, results can be easily extended to the case of non-identically distributed data in a manner described in section 2.4. Assumption 1 will be required to hold coordinatewise, namely, we will assume that there exist sequences , such that
[TABLE]
Note that the maximum over the second index disappears due to the i.i.d. assumption.
Theorem 3.20**.**
Let be such that and for . Let assumption 1 hold for each coordinate of , and suppose that and are such that
[TABLE]
Then for all satisfying (13) and all simultaneously,
[TABLE]
with probability at least .
Proof 3.21**.**
See section 5.6.
3.1 Example: multivariate median-of-means estimator.
Consider the special case of Theorem 3.20 when is the mean of , is the sample mean evaluated over the subsample , and for all . In this case, becomes the spatial median with respect to the -norm, namely,
[TABLE]
The problem of finding the mean estimator that admits sub-Gaussian concentration around under weak moment assumptions on the underlying distribution has recently been investigated in several works. For instance, Joly et al. (2016) construct an estimator that admits “almost optimal” behavior under the assumption that the entries of possess 4 moments. Recently, Lugosi and Mendelson (2017, 2018) proposed new estimators that attains optimal bounds and requires existence of only 2 moments. More specifically, the aforementioned papers show that, for any such that , there exists an estimator such that with probability at least ,
[TABLE]
where are numerical constants, is the covariance matrix of , is its trace and - its largest eigenvalue. However, construction of these estimators explicitly depends on the desired confidence level , and (more importantly) they are numerically difficult to compute.
On the other hand, Theorem 3.20 demonstrates that performance of the multivariate median-of-means estimator is robust with respect to the choice of the number of subgroups , and the resulting deviation bounds hold simultaneously over the range of confidence parameter whenever the coordinates of possess moments for some . The following corollary summarizes these claims.
Corollary 3.22**.**
Let be i.i.d. random vectors such that is the unknown mean, is the covariance matrix, , and for some . Then there exist absolute constants such that for all and satisfying
[TABLE]
with probability at least for all simultaneously,
[TABLE]
Proof 3.23**.**
It follows from fact 2 in section 5.1 that can be bounded as
[TABLE]
for an absolute constant . Moreover, it is easy to see that for all and that assumption 1 holds with . Now the claim immediately follows from Theorem 3.20.
Remark 3.24**.**
Estimator (15) admits a natural generalization of the form
[TABLE]
where is a norm in and is a convex, non-decreasing function. For example, if is the Euclidean norm, resulting estimator is invariant with respect to the orthogonal transformations. However, available performance guarantees for this estimator hold under stronger assumptions (such as joint asymptotic normality of the coordinates of ’s instead of coordinate-wise asymptotic normality), and exhibit suboptimal dependence on the dimension; these results, along with the discussion of relevant numerical methods, are presented in Appendix C. Complete characterization of the effect of the norm on the geometry of the problem and performance of the corresponding estimator (16) warrants further study.
4 Simulation results.
We illustrate results of the previous sections with numerical simulations that compare performance of the median-of-means estimator with the usual sample mean, see figure 2 below.
Moreover, we compared the theoretical guarantees for the median-of-means estimator (described in section 2.2) against the empirical outcomes for the Lomax distribution with shape parameter and scale parameter ; the corresponding probability density function is
[TABLE]
In particular, the Lomax distribution with and has mean and median . Since the mean and median do not coincide, the error of the median-of-means estimator has a significant bias component for large values of . Figure 3 depicts the impact of the bias beyond (equivalently, ), and also the fact that the median error is mostly flat for .
Finally, we assessed empirical coverage of the confidence intervals constructed using Theorem 2.15 and centered at the median-of-means estimator; results are presented in figure 4. The sample of size was generated from the half-t distribution with degrees of freedom; recall that a random variable has half-t distribution with degrees of freedom if where has usual t-distribution with degrees of freedom. It is clear that half-t distribution is both asymmetric and heavy-tailed. Each sample was further corrupted by outliers sampled from the normal distribution with mean [math] and standard deviation ; the number of outliers ranged from [math] to with increments of . The median-of-means estimator was constructed for . For comparison, we present empirical coverage levels attained by the sample mean in the same framework.
5 Proofs
In this section, we present the proofs of the main results.
5.1 Preliminaries.
We recall several facts that are used in the proofs below. The following bound has been established by A. Berry (Berry, 1941) and C.-G. Esseen (Esseen, 1942). A version with an explicit constant given below is due to Shevtsova (2011).
Fact 1** (Berry-Esseen bound).**
Assume that is a sequence of i.i.d. copies of a random variable with mean , variance and such that . Then
[TABLE]
where and is the cumulative distribution function of the standard normal random variable.
The following generalization of Berry-Esseen bound is due to Petrov (1995).
Fact 2** (Generalization of Berry-Esseen bound).**
Assume that is a sequence of i.i.d. copies of a random variable with mean , variance and such that for some . Then there exists an absolute constant such that
[TABLE]
Next, we recall a well-known concentration inequality.
Fact 3** (Bounded difference inequality).**
Let be i.i.d. random variables, and assume that , where is such that for all and all ,
[TABLE]
Then
[TABLE]
and
[TABLE]
Finally, we recall the definition of a U-statistic. Let be a measurable function of variables, and
[TABLE]
A U-statistic of order with kernel based on the i.i.d. sample is defined as (Hoeffding, 1948)
[TABLE]
Clearly, , moreover, has the smallest variance among all unbiased estimators. The following analogue of fact 3 holds for the U-statistics:
Fact 4** (Concentration inequality for U-statistics, (Hoeffding, 1963)).**
**
Assume that the kernel satisfies for all . Then for all ,
[TABLE]
5.2 Proof of Theorem 1.
Observe that
[TABLE]
Let be the distribution function of and - the empirical distribution function corresponding to the sample , that is,
[TABLE]
Suppose that is fixed, and note that is a function of the random variables , and . Moreover, the hypothesis of the bounded difference inequality (fact 3) is satisfied with for , and therefore it implies that
[TABLE]
on the draw of with probability .
Let be such that and . Applying (17) for and together with the union bound, we see that for ,
[TABLE]
on an event of probability . It follows that on , and simultaneously, hence
[TABLE]
by the definition of the median. It remains to estimate and . Assumption 1 implies that
[TABLE]
Hence, it suffices to find such that . Recall that , and let be the solution of the equation
[TABLE]
Note that always exists since by assumption. Finally, since , it is clear that any
[TABLE]
satisfies the requirements. Similarly,
[TABLE]
by assumption 1, hence it is sufficient to choose such that , where satisfies . Noting that and recalling (18), we conclude that
[TABLE]
with probability at least .
5.3 Proof of Theorem 2.13.
We will use notation as in the proof of Theorem 1. Clearly, satisfies the equation , where
[TABLE]
Suppose are such that and . Since is increasing, it is easy to see that . To find such and , we proceed in 3 steps.
(a) First, observe that the bounded difference inequality (fact 3) implies that for any fixed ,
[TABLE]
with probability .
(b) Next, we will find an upper bound for
[TABLE]
where are independent. Note that for any bounded non-negative function and a signed measure ,
[TABLE]
Since any bounded function can be written as , we deduce that
[TABLE]
Moreover, if is monotone, the sets and are half-intervals. Applying this to and , we deduce that
[TABLE]
by assumption 1.
(c) In remains to find satisfying
[TABLE]
Let and . Since (where ’s were defined in (7)), it suffices to find such that for all . For any bounded function such that and for , and any ,
[TABLE]
where . Recall that is such that for . It follows that
[TABLE]
where . Next, Lemma B.29 implies that
[TABLE]
Combining the previous two bounds, we deduce that it suffices to find such that
[TABLE]
By our assumptions, . Lemma 2.3 yields that it suffices to take
[TABLE]
The estimate for follows the same pattern, and yields that one can take as
[TABLE]
implying the claim.
5.4 Proof of Theorem 2.15.
Recall that for , and note that under our assumptions, equation has a unique solution (even if is not strictly convex). Next, observe that
[TABLE]
hence it suffices to show that both the left-hand side and the right-hand side of the inequality above converge to for all . We will outline the argument for the left-hand side, and the remaining part is proven in a similar fashion. Note that
[TABLE]
where .
Lemma 5.25**.**
Under the assumptions of Theorem 2.15, and
[TABLE]
where .
Proof 5.26** (of Lemma 5.25).**
Let . Since is convex, its derivative is monotone and continuous almost everywhere (with respect to Lebesgue measure). Together with the assumption that , Lebesgue dominated convergence Theorem implies that
[TABLE]
Next, we will prove the assertion that . It is easy to see that
[TABLE]
Reasoning as in the proof of Theorem 2.13 (see step (b) in section 5.3), we deduce that
[TABLE]
where is the function from assumption 1. Hence, recalling that as , we obtain that
[TABLE]
On the other hand, it follows from (20) that for
[TABLE]
For , it is also clear that . To establish the fact that , note that weak convergence of to the normal law (assumption 1) together with Lebesgue dominated convergence Theorem implies that
[TABLE]
Since , we deduce that
[TABLE]
and the claim follows.
Lemma 5.25 implies that . It remains to apply Lindeberg’s Central Limit Theorem (Serfling, 1981, Theorem 1.9.3) to ’s to deduce the result from equation (19). To this end, we only need to verify the Lindeberg condition requiring that for any ,
[TABLE]
However, since (and hence ) is bounded, (21) easily follows.
5.5 Proof of Theorem 2.18.
The argument is similar to the proof of Theorem 1. Let be the distribution function of and - the empirical distribution function corresponding to the sample of size .
Suppose that is fixed, and note that is a U-statistic with mean . We will apply the concentration inequality for U-statistics (fact 4) with to get that
[TABLE]
with probability ; here, we also used the fact that .
Let be such that and . Applying (22) for and together with the union bound, we see that for ,
[TABLE]
on an event of probability . It follows that on , . The rest of the proof repeats the argument of section 5.2.
5.6 Proof of Theorem 3.20.
Set . Then by the definition. Since is convex, the sufficient and necessary condition for to be its minimizer is that , the subdifferential of at point . It is easy to see that
[TABLE]
where and are the right and left derivative of at point respectively.
Since the subdifferential is convex, it suffices to find points such that for all ,
[TABLE]
This task has already been accomplished in the proof of Theorem 2.13: since are nondecreasing functions, repeating the argument of section 5.3 yields that, on an event of probability , inequalities (23) hold with
[TABLE]
We have thus shown that for each ,
[TABLE]
with probability . Applying the union bound over all , we obtain the result.
5.7 Proof of Lemma 2.3.
It is a simple numerical fact that whenever
[TABLE]
(indeed, this follows since ). Set for brevity. Since , we have
[TABLE]
where the last inequality follows since . Equation (25) implies that . Proceeding again as in (25), we see that
[TABLE]
hence The claim follows since for all by assumption, and .
Acknowledgements
Authors would like to thank Anatoli Juditsky for many insightful comments and suggestions.
Appendix A Central limit theorem for the non-i.i.d. data.
We present an extension of Theorem 2.15 to non-i.i.d. data for the estimator that holds under the following assumptions:
are independent, , and ; 2. 2.
Assumption 1 is satisfied with some and , ; 3. 3.
and as ; 4. 4.
, where is the harmonic mean of ’s.
Theorem A.27**.**
Under assumptions (a)-(e) above,
[TABLE]
Proof A.28**.**
Define , and . We will show that
* as ;* 2. 2.
* as .*
To prove the first claim, first assume that (for the argument follows the same line with simplifications), and observe that
[TABLE]
Moreover,
[TABLE]
while under assumption (d),
[TABLE]
It then follows from assumption (c) that
[TABLE]
Claim (b) follows since and under assumption (d).
The rest of the argument repeats the proof of Theorem 2.15 for .
Appendix B Supplementary results.
Lemma B.29**.**
Let be symmetric, meaning that , and let . Then for all ,
[TABLE]
Proof B.30**.**
Observe that
[TABLE]
and the claim follows.
Lemma B.31**.**
Inequality holds for all . Moreover, if and , then .
Proof B.32**.**
Since for all ,
[TABLE]
Note that is decreasing on . Whenever , , hence .
Appendix C Results for the spatial median with respect to the norm.
In this section, we discuss estimation of the multivariate parameter based on the -median. Let be i.i.d. copies of randomly partitioned into disjoint groups of cardinality each, and let be a sequence of i.i.d. estimators of . We define
[TABLE]
be the median of .
Let have multivariate normal distribution , and define for a Borel measurable set . Moreover, define to be the set of closed cones,
[TABLE]
We will assume that is “asymptotically normal on cones”:
Assumption 2
There exists a sequence and a positive-definite matrix such that and
[TABLE]
Theorem C.33**.**
Let assumption 2 be satisfied. Then with probability ,
[TABLE]
where
[TABLE]
and .
Remark C.34**.**
It follows from Lemma B.31 that whenever the right-hand side of the inequality (28) is bounded by , , which leads to a more explicit bound for .
As an example, we consider the problem of the multivariate mean estimation. Recall that the condition number of a non-singular matrix is defined as .
Corollary C.35**.**
Let be a sequence of i.i.d. copies of a random vector such that , , and . Define
[TABLE]
Assume that and are such that
[TABLE]
Then
[TABLE]
with probability , where and are the same as in Theorem C.33.
Proof C.36**.**
It follows from the multivariate Berry-Esseen bound (fact 5) that assumption 2 is satisfied with , and . Noting that , it is easy to deduce the bound from (28) and remark C.34.
Remark C.37**.**
Note that, similarly to the case , whenever (hence, ), the bound of Corollary C.35 is of order with respect to the sample size . However, dependence of the bound on the dimension factor is suboptimal.
C.1 Overview of numerical algorithms.
Letting , is convex and it achieves its minimum at a unique point (unless are on the same line (Kemperman, 1987)) that belongs to the convex hull of .
The classical algorithm that approximates is the famous Weiszfeld’s algorithm (Weiszfeld, 1936): starting from some in the affine hull of , iterate
[TABLE]
where . H. W. Kuhn (Kuhn, 1973) showed that Weiszfeld’s algorithm converges to the geometric median for all but countably many initial points. It is easy to check that (29) is a gradient descent scheme: indeed, it is equivalent to
[TABLE]
where and is the gradient of (we assume here that ).
Various improvements and accelerated versions of Weiszfeld’s algorithm have been proposed and analyzed. Ostresh (1978) provides a modified version of Weiszfeld’s algorithm that converges to the geometric median under reasonable initialization conditions, but the rate of convergence is not specified. Kärkkäinen and Ayrämö (2005) consider empirical behavior of several modifications of Weiszfeld’s algorithm, and obtains convergence for an SOR method. Vardi and Zhang (2000) demonstrate convergence of another modified Weiszfeld algorithm, but only provides empirical convergence rates. Overton (1983) provides an algorithm that exhibits quadratic convergence under some assumptions, but a quantitative rate is not expressed. Cardot et al. (2013) develops an online stochastic descent algorithms and provides an asymptotic convergence rate. Quantitative error bounds are not available for any of the algorithms discussed so far.
Literature from computer science considers the computational complexity of algorithms for computing such that is close to the minimum value . A thorough comparison of such results is provided by Cohen et al. (2016). The results from this work are fully quantitative, but they need to be adapted to our setting. In our statistical estimation setting, we are using to estimate the true parameter , so we want bounds on the proximity instead of bounds on . The following theorem (proven in Section D.3) provides a “local lower bound.”
Theorem C.38**.**
Suppose , let , set for , and assume that the empirical covariance matrix satisfies
[TABLE]
where are the eigenvalues of listed with multiplicity and in non-increasing order. Then, for all ,
[TABLE]
where
[TABLE]
Theorem C.38 allows us to infer proximity bounds from all the computer science literature that discusses value bounds. Moreover, this bound is asymptotically stable in the i.i.d. sampling setting assuming the existence of three moments. For small , the lower bound is approximately quadratic, and hence the proximity bound behaves like . On the other hand, this local lower bound fits in well with the theory of Restarted Gradient Descent (Yang and Lin, 2015).
Appendix D Proofs of results in Appendix C.
D.1 Technical background.
Everywhere below, stands for the distribution of the normal vector with mean 0 and covariance matrix . The following multivariate version of the Berry-Esseen Theorem for convex sets has been established by Bentkus (2003).
Fact 5** (Multivariate Berry-Esseen bound).**
Assume that is a sequence of i.i.d. copies of a random vector with mean , covariance matrix and such that . Let have normal distribution , and be the class of all convex subsets of . Then
[TABLE]
where .
Given a metric space , the covering number is defined as the smallest such that there exists a subset of cardinality with the property that for all , . When metric is clear from the context, we will simply write .
Let be a stochastic process indexed by . We will say that it has sub-Gaussian increments with respect to metric if for all and ,
[TABLE]
Fact 6** (Dudley’s entropy bound).**
Let be a centered stochastic process with sub-Gaussian increments. Then the following inequality holds:
[TABLE]
where is the diameter of the space with respect to .
Proof D.39**.**
See (Talagrand, 2005).
Finally, we recall two useful facts related to Vapnik-Chervonenkis (VC) combinatorics (see van der Vaart and Wellner, 1996, for the definition of VC dimension and related theory). Let be a finite-dimensional vector space of real functions on .
Fact 7**.**
Let and Then
[TABLE]
Proof D.40**.**
See Proposition 3.6.6 in (Giné and Nickl, 2015).
Fact 8**.**
Let be a class of sets of VC-dimension . Then, for any probability measure ,
[TABLE]
for all ;
Proof D.41**.**
This bound follows from results of R. Dudley (Dudley, 1978) and D. Haussler (Haussler, 1995). The bound with explicit constants as stated above is given in (Pollard, 2000).
D.2 Proof of Theorem C.33.
By the definition of the geometric median,
[TABLE]
hence
[TABLE]
Set . Then (32) is equivalent to
[TABLE]
Denote by the distribution of , and by - the empirical distribution corresponding to the sample
[TABLE]
Let be the directional derivative of at point in direction . Clearly, for any such that . On the other hand, it is easy to check that , where
[TABLE]
Let be the set of closed cones defined in (27), and note that for any unit vector and ,
[TABLE]
Next, observe that
[TABLE]
We will assume that is chosen such that (if not, simply replace by ). Then (34) implies that
[TABLE]
It remains to estimate the left-hand side of inequality (35) from below and its right-hand side from above. We start by finding an upper bound (proved in section D.2.1) for .
Lemma D.42**.**
The following bound holds:
[TABLE]
where was defined in assumption 2.
The next Lemma (proved in section D.2.2) provides an upper bound for .
Lemma D.43**.**
With probability ,
[TABLE]
Finally, it remains to estimate from below. The following inequality (proved in section D.2.3) holds:
Lemma D.44**.**
Set . Then
[TABLE]
where is the hyperbolic tangent defined as .
It therefore follows from Lemmas D.42, D.43 and D.44 that with probability exceeding ,
[TABLE]
which implies the bound of Theorem C.33.
D.2.1 Proof of Lemma D.42.
Recall that for any non-negative function and a signed measure ,
[TABLE]
Hence
[TABLE]
where we used the identity . Next, it follows from (33) that
[TABLE]
by assumption 2. It implies that as claimed.
D.2.2 Proof of Lemma D.43.
Using (36) and proceeding as in the proof of Lemma D.42, we obtain that
[TABLE]
It follows from the bounded difference inequality (fact 3) that for all ,
[TABLE]
hence it is enough to control . To this end, we will estimate the covering numbers of the class of cones and use Dudley’s integral bound (fact 6).
Given a vector , let be its coordinates with respect to the standard Euclidean basis. Note that
[TABLE]
which is equivalent to and , where , and are functions of and , . In particular, every element of is the intersection of a half-space and a set , where is a polynomial of degree in variables. The dimension of the space of polynomials of degree at most is , hence the Vapnik-Chernonenkis dimension of the collection of sets \mathcal{S}_{V_{2,m}}=\Big{\{}\left\{x:f(x)\geq 0\right\},\ f\in V_{2,m}\Big{\}} is by fact 7. It follows from fact 8 that for any probability measure ,
[TABLE]
for all . It is also well known that (and can be deduced from the similar reasoning) that the VC-dimension of a collection of halfspaces of is , hence
[TABLE]
Given two collections of sets , let and be the - nets of smallest cardinality for the classes of functions and respectively. Let , and assume without loss of generality that and . Then
[TABLE]
which implies that the covering number of the class corresponding to intersections of elements of and satisfies
[TABLE]
In particular, the metric entropy of the class of cones can be bounded as
[TABLE]
uniformly over all probability measures , hence fact 6 implies that
[TABLE]
D.2.3 Proof of Lemma D.44.
Making the change of variables , we obtain
[TABLE]
where . Let , and note that since by assumption. Let be any orthogonal transformation that maps to (here, is the standard Euclidean basis of ). Then, letting , we observe that
[TABLE]
Setting , we obtain from the last inequality that
[TABLE]
Set , where and . We will also let denote the density (with respect to Lebesgue measure) of the standard normal distribution on . Then
[TABLE]
Setting , we have that
[TABLE]
Now, for any ,
[TABLE]
hence
[TABLE]
where we have use the inequality whenever and , and . Finally, a well-known bound states that if has distribution, then for all
[TABLE]
For , it implies that
[TABLE]
which concludes the proof.
D.3 Proof of Theorem C.38.
To simplify notation in what follows, we let . We let for all and observe that a weak gradient of is given by
[TABLE]
Hence, is a weak gradient of .
Now, fix with , let , and set . The second fundamental theorem of calculus yields
[TABLE]
In this last line, we have set and . By Cauchy-Schwarz, we have that . If , then
[TABLE]
for all . If , then we have that
[TABLE]
Note that since is the minimizer. Consequently, we have
[TABLE]
Given that
[TABLE]
we obtain the lower bound
[TABLE]
Noting that the inverse cubic function is convex, Jensen’s inequality and straightforward integration yields
[TABLE]
We now observe that
[TABLE]
and also that
[TABLE]
where is an orthonormal basis for . We further observe that
[TABLE]
The Courant-Fischer characterization of eigenvalues gives us
[TABLE]
where are the eigenvalues of the empirical covariance matrix listed with multiplicity and in non-increasing order. We therefore have
[TABLE]
and the result follows.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alon et al. (1996) Alon, N., Matias, Y. and Szegedy, M. (1996) The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , 20–29. ACM.
- 2Arcones (1996) Arcones, M. A. (1996) The Bahadur-Kiefer representation for U-quantiles. The Annals of Statistics , 24 , 1400–1422.
- 3Battey et al. (2015) Battey, H., Fan, J., Liu, H., Lu, J. and Zhu, Z. (2015) Distributed estimation and inference with statistical guarantees. ar Xiv preprint ar Xiv:1509.05457 .
- 4Bentkus (2003) Bentkus, V. (2003) On the dependence of the berry–esseen bound on dimension. Journal of Statistical Planning and Inference , 113 , 385–402.
- 5Bentkus et al. (1997) Bentkus, V., Bloznelis, M. and Götze, F. (1997) A Berry–Esséen bound for M-estimators. Scandinavian journal of statistics , 24 , 485–502.
- 6Berry (1941) Berry, A. C. (1941) The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the american mathematical society , 49 , 122–136.
- 7Bickel et al. (1965) Bickel, P. J. et al. (1965) On some robust estimates of location. The Annals of Mathematical Statistics , 36 , 847–858.
- 8Bubeck et al. (2013) Bubeck, S., Cesa-Bianchi, N. and Lugosi, G. (2013) Bandits with heavy tail. IEEE Transactions on Information Theory , 59 , 7711–7717.
