Stein's method for normal approximation in Wasserstein distances with application to the multivariate Central Limit Theorem
Thomas Bonis

TL;DR
This paper develops Stein's method to bound Wasserstein distances for normal approximation, providing optimal convergence rates for the multivariate CLT under minimal moment conditions.
Contribution
It introduces a novel approach using stochastic processes to bound Wasserstein distances of any order, extending Stein's method for multivariate normal approximation.
Findings
Bounds Wasserstein distance of order 2 using stochastic process
Extends bounds to Wasserstein distances of any order p ≥ 1
Provides optimal convergence rates for multivariate CLT
Abstract
We use Stein's method to bound the Wasserstein distance of order between a measure and the Gaussian measure using a stochastic process such that is drawn from for any . If the stochastic process satisfies an additional exchangeability assumption, we show it can also be used to obtain bounds on Wasserstein distances of any order . Using our results, we provide optimal convergence rates for the multi-dimensional Central Limit Theorem in terms of Wasserstein distances of any order under simple moment assumptions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Stein’s method for normal approximation in Wasserstein distances with application to the multivariate Central Limit Theorem
Thomas Bonis
DataShape team, Inria Saclay, Université Paris-Saclay, Paris, France
Abstract
We use Stein’s method to bound the Wasserstein distance of order between a measure and the Gaussian measure using a stochastic process such that is drawn from for any . If the stochastic process satisfies an additional exchangeability assumption, we show it can also be used to obtain bounds on Wasserstein distances of any order . Using our results, we provide optimal convergence rates for the multi-dimensional Central Limit Theorem in terms of Wasserstein distances of any order under simple moment assumptions.
1 Introduction
Consider independent and, for simplicity, identically distributed random variables taking values in such that and . By the Central Limit Theorem, it is well-known that, as grows to infinity, the law of converges to the -dimensional Gaussian measure . In order to strengthen this result, one can quantify this convergence for a given distance on the space of measures on . Let us consider the family of Wasserstein distances of order , defined between any two measures and with finite moment of order by
[TABLE]
where denotes the Euclidean norm and is a measure on with marginals and . In the univariate setting, rates of convergence for these distances have been obtained in [12] for and in [2] for . More precisely, for any , there exists a constant such that
[TABLE]
Furthermore, Theorem 5.1 [12] guarantees this bound to be tight in the general case. In the multivariate setting, convergence rates for the Wasserstein distance of order have been obtained under the assumption that with , see [17] and [4], in which case there exists such that
[TABLE]
As this result is short of optimality in the one-dimensional case, it is conjectured in [17] that
[TABLE]
and such a bound is known to be matched thanks to Proposition 2 [17]. Let us note that, since is greater than , this bound scales at least linearly with respect to the dimension which is probably suboptimal in many cases. Indeed, whenever the coordinates of the are i.i.d. random variables with fourth moment equal to , one can use (1) to obtain the following bound, scaling with ,
[TABLE]
This optimal scaling with respect to the dimension as well as the optimal dependency in can be obtained whenever the measure of the satisfies a Poincaré inequality with constant in which case Theorem 4.1 [3] guarantees that
[TABLE]
and similar bounds have also been obtained for Wasserstein distances of any order in [5]. However, for a measure to satisfy a Poincaré inequality is a strong assumption compared to the simple moment assumption required in the univariate case.
Inequality (3) is derived through an approach introduced in [8] relying on a object called Stein kernel. Given a probability measure supported on , a Stein kernel for is a matrix-valued function such that, for any smooth function with compact support,
[TABLE]
where is the Hilbert-Schmidt scalar product and denotes the Hessian matrix of . Since is equal to the Gaussian measure if and only if , one can expect to be close to whenever is close to . This intuition is formalized by the following bound, obtained in Proposition 3.1 [8],
[TABLE]
where is the Hilbert-Schmidt norm. Furthermore, if also verifies
[TABLE]
for any suitable function , then, by Proposition 3.4 [8], one also has
[TABLE]
where is the Schatten -norm and is a constant depending only on . However, as Stein kernels do not necessarily exist for general measures and can be difficult to compute whenever they do exist, they are not an adequate tool to generalize (1).
In this work, we wish to apply the approach developed in [8] by replacing Stein kernels with more practical operators satisfying the following property
[TABLE]
where denotes the space of smooth functions with compact support. When an operator verifies this property, in which case we say is invariant under , one can expect to be close to as soon as is similar to the operator defined by
[TABLE]
There are many ways to obtain operators under which is invariant; in fact, such operators have been extensively used in Stein’s method. For instance, the original approach of Stein [14] and its extension to the multidimensional setting [11] use pairs of random variables both drawn from and such that and follow the same law. Given such a pair of random variables , which is called an exchangeable pair, is invariant under the operator defined by
[TABLE]
where is a rescaling factor. This operator can then be compared to using a Taylor expansion. In fact, one does not even need an exchangeable pair to apply Stein’s method in dimension one. Indeed, as shown by [13], one can use two random variables , both drawn from but not necessarily forming an exchangeable pair, to construct operators of the form
[TABLE]
Similarly, many other constructs used to apply Stein’s method such as zero-bias coupling [6] and size-bias coupling [7] correspond to operators under which is invariant.
Among these various operators, those defined in (5) are perhaps the easiest to obtain as they can be constructed from any two random variables both drawn from the measure . However, since there is no notion of primitive functions in higher dimension, such operators are restricted to the univariate setting. Still, in the multidimensional setting, one can use any two random variables and drawn from to define an operator under which is invariant by taking
[TABLE]
Then, given any , one can use a Taylor expansion to obtain
[TABLE]
Thus, one can expect that if
- •
;
- •
and
- •
then would be similar to and thus be close to . However, one cannot prove such a result by applying the approach of [8] to such operators. Instead, we use stochastic processes such that is drawn from for any and such that does not grow too fast with respect to to define a family of operators under which is invariant by taking
[TABLE]
In Theorem 2, we derive bounds for the Wasserstein distance of order between and the Gaussian measure from such a family of operators. We also provide bounds on Wasserstein distances of any order for one-dimensional normal approximation in Theorem 7 and for multidimensional normal approximation in Theorem 9. This latter result uses a family of operators of the form (4) and thus requires the pairs and to follow the same law for any . Let us note that, while we mostly focus on operators defined in (7), proofs of our results can easily be adapted to other operators under which is invariant such as size-bias or zero-bias couplings.
Our results can be readily applied to obtain rates in the Central Limit Theorem. Indeed, letting be independent copies of and be a uniform random variable on , the stochastic process defined by
[TABLE]
is such that and follow the same law for any . Applying our results to this stochastic process, we obtain the following bounds.
Theorem 1**.**
Under the above setting, if , then there exists such that
[TABLE]
Furthermore, if for , then there exists depending only on and such that
[TABLE]
This result both proves (2) and generalizes (1). However, our bound still scales at least linearly with respect to the dimension and thus fails to generalize (3) which can scale with . Our approach can also be used to obtain more general results, presented in Theorems 11 and 12, which only require the random variables to be independent and provide intermediary rates of convergence under weaker moment assumptions.
The paper is organized as follows. In Section 2, we introduce the notations used in the paper. In Section 3, we present the main arguments we use to apply Stein’s method and obtain bounds on the Wasserstein distance of order in normal approximation. The approach followed to obtain bounds on Wasserstein distances of any order is then detailed in Section 4. The computations required to apply our general Wasserstein bounds to obtain rates of convergence in the Central Limit Theorem are presented in Sections 5 and 6. Finally, Sections 7 and 8 contain technical results and approximation arguments used in the course of this paper.
2 Notations and definitions
Let be a positive integer. A -dimensional multi-index is a -tuple of non-negative integers
[TABLE]
The absolute value of a multi-index is given by
[TABLE]
and its factorial by
[TABLE]
For any and any multi-index , let
[TABLE]
For any , we denote by the family indexed by multi-indices with absolute value and such that
[TABLE]
In this work, we identify any symmetric matrix to the family indexed by multi-indices with absolute value by taking when and when . Let be the Hilbert Schmidt scalar product defined between any two families by
[TABLE]
and, by extension,
[TABLE]
Let us remark that, for any , we have
[TABLE]
Let be the set of functions from to with partial derivatives of order and by the set of such functions with compact support. For any multi-index and any , let
[TABLE]
Let be the -th gradient of at defined by
[TABLE]
Let denoted the -dimensional Gaussian measure and let be the operator defined by
[TABLE]
This operator is the infinitesimal generator of the Ornstein-Uhlenbeck semigroup whose reversible measure is ; see e.g. [1] for a thorough presentation of this semigroup and its properties.
3 Bounds for the Wasserstein distance of order
In this Section, we prove the following result.
Theorem 2**.**
Let be a probability measure on with finite second moment and let be a stochastic process such that is drawn from for any . Suppose that
[TABLE]
Then, for any ,
[TABLE]
where
[TABLE]
Let be a measure on and let be a stochastic process such that is drawn from for any . Let us assume the measure admits a density with respect to such that for some constant and and suppose the stochastic process is bounded for any . Let us note that, while such assumptions imply a Stein kernel exists, approximation arguments developed in Section 8 allow us to lift them in favor of the weaker (10).
For , let be the measure with density . Since is the reversible measure of , converges to when grows to infinity. One can thus bound by controlling for any and letting grow. To this end, we use the following inequality, obtained in Lemma 2 [9],
[TABLE]
which yields
[TABLE]
The quantity is the Fisher information of the measure with respect to . In Proposition 2.4 [8], this quantity is bounded using Stein kernels. In this work, we bound using the stochastic process .
Proposition 3**.**
Under the above setting, we have
[TABLE]
where is defined in Theorem 2.
As injecting this bound in (11) and using the approximation arguments of Section 8 concludes the proof of Theorem 2, the remainder of this Section is dedicated to the proof of this Proposition.
Let and let . By Equation (2.12) [8], we have
[TABLE]
Hence, if an operator verifies
[TABLE]
then
[TABLE]
Now, let and let be the operator such that, for any and any ,
[TABLE]
Since and are drawn from the same law, integrating this operator with respect to gives
[TABLE]
Let us rewrite using a Taylor expansion.
Lemma 4**.**
Let be a bounded and measurable function and let and be a multi-index. Under the above setting, we have that
[TABLE]
exists and that
[TABLE]
We delay the proof of this result to Section 7.1. Let be an integer, after rearranging terms, we have
[TABLE]
Thus,
[TABLE]
Then, by (12),
[TABLE]
Let be a bounded and measurable function. By Equation (2.7.3) [1],
[TABLE]
In particular if is a function such that is bounded, we have . For any multi-index , let be the multivariate Hermite polynomial of index , defined for any by
[TABLE]
Let be a bounded function. For any multi-index , starting with (15) and integrating times with respect to the Gaussian measure, we obtain
[TABLE]
Since Hermite polynomials form an orthogonal basis of with norms
[TABLE]
applying (16) to the vector field yields, for any and any multi-index ,
[TABLE]
Therefore,
[TABLE]
Now, let
[TABLE]
Applying Cauchy-Schwarz inequality on (14) and using (17), we obtain
[TABLE]
Then, since ,
[TABLE]
Finally, since is finite,
[TABLE]
and rearranging terms in using (13) concludes the proof of Proposition 3.
4 Gaussian measure and Wasserstein distances of any order
Let and let be a measure on . Let us assume the measure admits a density with respect to such that with and .
In order to bound the distance between and the -dimensional Gaussian measure , it is possible to use Stein kernels to obtain a version of the score function [8]. Indeed, by Section 3 [16], this score function can be used to bound the Wasserstein distances between and as
[TABLE]
leading to
[TABLE]
Let us provide a version of . Let be a Gaussian random variable, be a random variable drawn form and let .
Lemma 5**.**
Let . Then, under the above notations,
[TABLE]
is a version of .
Proof.
Let . Integrating by parts with respect to , we have, for any ,
[TABLE]
Thus,
[TABLE]
In fact, this property completely characterizes : if another vector field satisfies
[TABLE]
then
[TABLE]
implying that almost everywhere with respect to the measure .
Now, let . Integrating by parts with respect to the Gaussian measure, we have
[TABLE]
implying that it is a version of . ∎
Bounding can thus be achieved by estimating , where is defined in Lemma 5. To this end, suppose there exists a quantity such that almost surely. Then,
[TABLE]
and, by Jensen’s inequality,
[TABLE]
Therefore, if such a quantity is close to then is small and, by (18), so is . Before showing how to compute such quantities in the following Sections, let us state the following result, proved in Section 7.2.
Lemma 6**.**
Let be a normal random variable and let . Then,
[TABLE]
4.1 One-dimensional case
In this Section, we bound the distance between and in the case and obtain the following result.
Theorem 7**.**
Let and let be a probability measure on with finite moment of order . Let be a stochastic process such that is drawn from for any . Suppose that
[TABLE]
Then, for any ,
[TABLE]
where
[TABLE]
Let and let be a stochastic process such that for any , is drawn from and is bounded. Again, thanks to approximation arguments developed in Section 8, this assumption as well as the assumptions made on the smoothness of the measure can be lifted in favor of the more general (21). For now, let us start by using to obtain a quantity such that .
Lemma 8**.**
Let . Letting
[TABLE]
where is the one-dimensional -th Hermite polynomial, we have
[TABLE]
Proof.
Let . For any , we denote by the -th derivative of . Let . Since and are independent, applying (16) yields
[TABLE]
Thus,
[TABLE]
Now, let be a primitive function of . By Lemma 4, the function satisfies
[TABLE]
Then, since and are both drawn from ,
[TABLE]
implying that almost surely. ∎
Returning to the proof of Theorem 7, letting and using Lemma 8 along with Lemma 5 and Jensen’s inequality, we obtain
[TABLE]
Then, by Lemma 6,
[TABLE]
where
[TABLE]
Finally, by (18),
[TABLE]
and using approximation arguments concludes the proof of Theorem 7.
4.2 Multi-dimensional case
Unfortunately, it is not possible to use a multi-dimensional generalization of the random vector defined in Lemma 8 as we would only be able to show that
[TABLE]
which is not sufficient to assert that . Instead, one can add an exchangeability assumption on the stochastic process to obtain the following result.
Theorem 9**.**
Let and let be a probability measure on with finite moment of order . Let be a stochastic process such that is drawn from and such that the pairs and follow the same law for any . Suppose that, for any ,
[TABLE]
Then, for any ,
[TABLE]
where
[TABLE]
Let and let be a stochastic process such that, for any , and follow the same law and is bounded. Again, this last assumption as well as our previous smoothness assumptions on the measure can be replaced by (22) thanks to approximation arguments derived in Section 8. Let us start by using the stochastic process to define a quantity such that .
Lemma 10**.**
Let . The quantity
[TABLE]
satisfies
[TABLE]
Proof.
Let . We have
[TABLE]
Hence, by (16),
[TABLE]
Let . By Lemma 4, we have
[TABLE]
Then, since the pairs and follow the same law,
[TABLE]
and thus . ∎
Returning to the proof of Theorem 9 and using Lemma 10 along with Lemma 5 and Jensen’s inequality, we obtain
[TABLE]
Thus, by Lemma 6,
[TABLE]
where
[TABLE]
Then, injecting this bound in (18) yields
[TABLE]
Finally, rearranging terms in using (13) and using approximation arguments concludes the proof of Theorem 9.
5 Central Limit Theorem for the distance
Let and be independent random variables taking values in and such that
- •
;
- •
and
- •
.
It is known that the measure of the random variable converges to the Gaussian measure . The remainder of this Section is dedicated to quantifying this convergence for the Wasserstein distance of order in order to obtain the following result.
Theorem 11**.**
Under the above setting, taking
[TABLE]
we have, for any ,
[TABLE]
Let be independent copies of the variables . For any , let and
[TABLE]
where is a uniform random variable taking values in and denotes the maximum between and .
For any , is drawn from the same measure as and . Thus, we can apply Theorem 2 to the measure of using the stochastic process with to obtain
[TABLE]
where
[TABLE]
Let us bound for . First, since and are independent,
[TABLE]
Then, since and since and are independent,
[TABLE]
and, since ,
[TABLE]
Now, taking
[TABLE]
and applying Jensen’s inequality yields
[TABLE]
Therefore,
[TABLE]
From here, developing the squared terms and using the independence of the and , we obtain
[TABLE]
where, for any ,
[TABLE]
and
[TABLE]
5.1 Bounding
Let and let be an odd integer. Since and are i.i.d.,
[TABLE]
Let us now deal with . Since , we have
[TABLE]
First, since is positive,
[TABLE]
Now, taking , we have
[TABLE]
and, since this bound is valid for any ,
[TABLE]
Similarly, for any even integer and any ,
[TABLE]
leading to
[TABLE]
Let us introduce the quantity defined for by
[TABLE]
and, for any , by
[TABLE]
By combining our bounds on the , we obtain
[TABLE]
5.2 Bounding
Again, taking , we have
[TABLE]
Then,
[TABLE]
Finally, for any integer ,
[TABLE]
Overall, letting
[TABLE]
we obtained
[TABLE]
5.3 Integration with respect to
Thanks to the previous computations, we have
[TABLE]
and thus
[TABLE]
The next step of the proof consists in integrating with respect to . First,
[TABLE]
And, since , we have, by Jensen’s inequality,
[TABLE]
Let us now deal with the remaining term. Let us first assume that . Taking , we have
[TABLE]
Since , taking , we have and
[TABLE]
If , performing the same computations with for yields
[TABLE]
Finally, if ,
[TABLE]
Then, taking ,
[TABLE]
which concludes the proof of Theorem 11.
5.4 Simplifications whenever
Let us now assume that for any and let . We have
[TABLE]
Furthermore,
[TABLE]
leading to
[TABLE]
Similarly,
[TABLE]
Therefore, taking
[TABLE]
we have
[TABLE]
Finally, remarking that and that for all whenever are identically distributed concludes the proof of (8).
6 Rates of the multi-dimensional CLT for distances
Let and . Let be independent random variables taking values in and such that
- •
;
- •
and
- •
.
The aim of this Section is to prove the following result.
Theorem 12**.**
Under the above setting, taking
[TABLE]
we have that there exists such that
[TABLE]
Taking as in the previous question, we have that and follow the same law for any . Therefore, we can apply Theorem 9 and perform computations similar to those of the previous Section in order to obtain
[TABLE]
with
[TABLE]
Then, using a multi-dimensional version of Rosenthal inequality such as Theorem 5.2 [10], we obtain that there exists such that
[TABLE]
where the and are the same as in the previous Section and
[TABLE]
Then, using arguments similar to the ones used to bound the ,
[TABLE]
Therefore, there exists such that
[TABLE]
and integrating with respect to following the arguments of the previous Section concludes the proof of Theorem 12 while (9) is obtained following the same computations as in Section 5.4.
7 Technical results
In this Section, we provide the proofs of the intermediary results used to derive Theorems 2,7 and 9.
7.1 Proof of Lemma 4
Let be a bounded and measurable function on , let and let be a multi-index. By (16), we have
[TABLE]
and, since is bounded, there exists such that
[TABLE]
Then, since is bounded as well, we have that there exists such that
[TABLE]
almost surely. Therefore
[TABLE]
and
[TABLE]
exists.
Now, using a Taylor expansion with remainder, we obtain that there exists on the segment such that
[TABLE]
From here, we have
[TABLE]
Then, by (25),
[TABLE]
and
[TABLE]
7.2 Proof of Lemma 6
Let such that for any multi-index and let be a Gaussian random variable. Let us start with the case . By Jensen’s inequality,
[TABLE]
Then, since for any two different multi-indices ,
[TABLE]
Now, let and . Since the Ornstein-Uhlenbeck semigroup is hypercontractive (see e.g. Theorem 5.2.3 [1]), we have
[TABLE]
This inequality can be readily extended to vector-valued functions , in which case we have
[TABLE]
For any multi-index , the Hermite polynomial is an eigenvector of with eigenvalue . Therefore,
[TABLE]
concluding the proof.
8 Approximation arguments
In this Section, we present the approximation arguments necessary to conclude the proof of Theorem 9. Similar arguments can be used to obtain Theorems 2 and 7.
Suppose the measure and the stochastic process satisfy the assumptions of Theorem 9. Let and
[TABLE]
Let and . For any , let be the orthogonal projection of on , the ball of radius centered at [math]. Let be a standard normal random variable, be a random variable with smooth density and taking values in the ball of radius and let be a Bernoulli random variable with parameter such that and are independent. Finally, let . For any , let
[TABLE]
Let be the law of . This measure admits a density with respect to the measure such that with . Furthermore, for any , and follow the same law. Therefore, we can follow the computations of Section 4.2 and use the triangle inequality to obtain
[TABLE]
where
[TABLE]
and
[TABLE]
First, since admits a finite moment of order , there exists such that
[TABLE]
Then, since is the orthogonal projection of on ,
[TABLE]
and, since admits a finite moment of order , there exists such that
[TABLE]
Therefore, there exists such that
[TABLE]
Now, let
[TABLE]
By the triangle inequality, we have that
[TABLE]
and, since ,
[TABLE]
Finally, let
[TABLE]
Since and are independent and since is -measurable, we have
[TABLE]
From here,
[TABLE]
Thus, applying the triangle inequality Jensen’s inequality yields
[TABLE]
Since is the orthogonal projection of on the convex set , we have and . Hence,
[TABLE]
By (22), there exists , depending on , such that, for any ,
[TABLE]
Hence, using Hölder’s inequality, we obtain that there exists such that
[TABLE]
Combining this bound with (26), (27), (28), (29) and (30), we obtain that there exists and such that
[TABLE]
Since has a finite moment of order and since , letting go to infinity and go to zero yields
[TABLE]
On the other hand, when goes to infinity, we have that converge weakly to and the -moment of converges to the -moment of . Thus, by Theorem 6.9 [15], converges to zero as goes to infinity. Therefore,
[TABLE]
concluding the proof of Theorem 9.
Acknowledgements
The author would like to thank Michel Ledoux for his many comments and advice regarding the redaction of this paper as well as Jérôme Dedecker, Yvik Swan, Frédéric Chazal and anonymous reviewers for their multiple remarks.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bakry, D., Gentil, I., Ledoux, M.: Analysis and Geometry of Markov Diffusion operators. Grundlehren der mathematischen Wissenschaften, Vol. 348. Springer (2014)
- 2[2] Bobkov, S.G.: Entropic approach to e. rio’s central limit theorem for w 2 transport distance. Statistics and Probability Letters 83 (7), 1644–1648 (2013)
- 3[3] Courtade, T.A., Fathi, M., Pananjady, A.: Existence of Stein Kernels under a Spectral Gap, and Discrepancy Bound. Ar Xiv e-prints (2017)
- 4[4] Eldan, R., Mikulincer, D., Zhai, A.: The CLT in high dimensions: quantitative bounds via martingale embedding. Ar Xiv e-prints (2018)
- 5[5] Fathi, M.: Stein kernels and moment maps. Ar Xiv e-prints (2018)
- 6[6] Goldstein, L., Reinert, G.: Stein’s method and the zero bias transformation with application to simple random sampling. Ann. Appl. Probab. 7 (4), 935–952 (1997)
- 7[7] Goldstein, L., Rinott, Y.: Multivariate normal approximations by stein’s method and size bias couplings. Journal of Applied Probability 33 , 1–17 (1996)
- 8[8] Ledoux, M., Nourdin, I., Peccati, G.: Stein’s method, logarithmic sobolev and transport inequalities. Geometric and Functional Analysis 25 (1), 256–306 (2015)
