Higher-order Stein kernels for Gaussian approximation
Max Fathi

TL;DR
This paper introduces higher-order Stein kernels for Gaussian approximation, extending classical kernels with higher derivatives, leading to improved convergence rates in the multidimensional CLT under certain conditions.
Contribution
The paper develops a new class of higher-order Stein kernels, establishing their properties and applications to enhance convergence rate bounds in the CLT.
Findings
New explicit rates of convergence in the multidimensional CLT.
Relations between higher-order Stein discrepancies and probability metrics.
Functional inequalities involving higher-order Stein kernels.
Abstract
We introduce higher-order Stein kernels relative to the standard Gaussian measure, which generalize the usual Stein kernels by involving higher-order derivatives of test functions. We relate the associated discrepancies to various metrics on the space of probability measures and prove new functional inequalities involving them. As an application, we obtain new explicit improved rates of convergence in the classical multidimensional CLT under higher moment and regularity assumptions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Higher-order Stein kernels for Gaussian approximation
Max Fathi
Abstract
We introduce higher-order Stein kernels relative to the standard Gaussian measure, which generalize the usual Stein kernels by involving higher-order derivatives of test functions. We relate the associated discrepancies to various metrics on the space of probability measures and prove new functional inequalities involving them. As an application, we obtain new explicit improved rates of convergence in the classical multidimensional CLT under higher moment and regularity assumptions.
1 Introduction
Stein’s method is a set of techniques, originating in works of Stein [31, 32], to bound distances between probability measures. We refer to [13, 29] for a recent overview of the field. The purpose of this work is a generalization of one particular way of implementing Stein’s method when the target measure is Gaussian, which is known as the Stein kernel approach.
Let be a probability measure on . A matrix-valued function is said to be a Stein kernel for (with respect to the standard Gaussian measure on ) if for any smooth test function taking values in , we have
[TABLE]
For applications, it is generally enough to consider the restricted class of test functions satisfying , in which case both integrals in (1) are well-defined as soon as , provided has finite second moments.
The motivation behind the definition is that, since the standard centered Gaussian measure is the only probability distribution on satisfying the integration by parts formula
[TABLE]
the Stein kernel coincides with the identity matrix, denoted by , if and only if the measure is equal to . Hence, a Stein kernel can be used to control how far is from being a standard Gaussian measure in terms of how much it violates the integration by parts formula (2). This notion appears implicitly in many works on Stein’s method, and has recently been the topic of more direct investigations [3, 12, 27, 22, 16].
However, (2) is not the only integration by parts formula that characterizes the Gaussian measure. For example, in dimension one, the standard Gaussian measure is characterized by the relation
[TABLE]
for all smooth test functions , where the are the Hermite polynomials . The case corresponds to the standard formula (2). While these are not the only integration by parts formulas one could state, they are in some sense the most natural ones, due to the role Hermite polynomials play as eigenfunctions of the Ornstein-Uhlenbeck generator.
Before defining higher-order Stein kernels, we must define a few notations. shall denote the space of -tensors on , that is -dimensional arrays of size , and the subspace of symmetric tensors, that is arrays such that for any permutation and we have . In particular, differentials of order of smooth functions belong to , and . We equip these spaces with their natural Euclidean structure and norm, which we shall respectively denote by and .
In dimension , Hermite polynomials are defined as follows:
Definition 1.1** (Multi-dimensional Hermite polynomials).**
For and indices such that , we define the Hermite polynomial as
[TABLE]
We also define , so that the coefficients of the -tensor are the .
The natural generalization of the notion of Stein kernels with respect to the integration by parts formulas defined via multidimensional Hermite polynomials would be to say that a -tensor is a -th order Stein kernel for is for any smooth we have
[TABLE]
However, it turns out that for the applications we shall describe below, it is more convenient to define higher-order Stein kernels in a different way:
Definition 1.2** (Higher-order Stein kernels).**
We define Stein kernels of order (as long as they exist) as any symmetric -tensor satisfying
[TABLE]
for all smooth vector-valued such that and are integrable with respect to .
Note that where is a classical Stein kernel. The choice of restricting the definition to symmetric tensors is non-standard when . It is motivated by the fact that since we only test out the relation on tensors of the form , which are symmetric, and will allow us to easily relate the expectation of such kernels to moments of the underlying measure.
In some sense, the point of view we develop here is very close to the one developed in [20], where approximate Stein identities with higher-order derivatives are used, in the framework of the zero-bias transform. The main advantage of the functional-analytic framework presented here is to allow more explicit estimates in the multivariate setting, albeit under strong regularity conditions. A particular upside of our estimates is that the dependence on the dimension will be very explicit.
A first remark is that we have the iterative relation
[TABLE]
As we shall later see in Lemma 2.1, for to exist, we must have for any . Of course, this is not a sufficient condition. These kernels are in some sense centered, so that is Gaussian iff . For , this does not exactly match with the usual definition, which is not centered, but this shift will make notations much lighter.
These Stein kernels can be related to kernels associated with Hermite polynomials via linear combinations. For example, if (resp. ) is a kernel associated with Hermite polynomials of degree 2 (resp. 1), then is a second-order Stein kernel in the sense of (3).
As for classical Stein kernels, we can then define the associated discrepancy, which measures how far a given probability measure is from satisfying the associated Gaussian integration by parts formula.
Definition 1.3**.**
The -th order Stein discrepancy is defined by
[TABLE]
where the infimum is over all possible Stein kernels of order for , since they may not be unique.
Remark 1.1**.**
The abstract setting we use here is not restricted to Hermite polynomials or higher-order derivatives. For example, it would be possible to define a kernel by considering any tensor-valued function and looking for a function such that for any smooth function we would have
[TABLE]
This more general point of view is related to the one developed in [25]. Existence would be treated in the same way as we shall implement in this work, but we do not have any other example leading to meaningful applications at this point.
The main application of these higher-order Stein kernels to the rate of convergence in the classical CLT is the following decay estimate, made precise in Corollary 4.4: if the random variables are iid, isotropic, centered and have mixed moments of order three equal to zero, then if is the law of the renormalized sum in the CLT we have an estimate of the form
[TABLE]
where is a constant we shall make precise, that depends on a regularity condition on the law of the . This seems to be the first improved rate of convergence in the multidimensional CLT in distance.
The plan of the sequel is as follows: in Section 2, we shall establish basic properties of higher-order Stein kernels, including existence and some first results on what distances the associated discrepancies control. In Section 3, we shall establish some functional inequalities relating Wasserstein distances, entropy and Fisher information. Finally, in Section 4, we shall derive various improved bounds on the rate of convergence in the central limit theorem under moment constraints.
2 Properties
2.1 Existence
Before studying these higher-order Stein kernels and their applications, the first question to ask is when do they actually exist? As for classical Stein kernels, there must be some condition beyond normalizing the moments, since they may not exist for measures with purely atomic support.
The first condition we can point out is that existence of Stein kernels constrain the values of certain moments:
Lemma 2.1**.**
Assume that admits Stein kernels up to order . Then for any polynomial in variables of degree we have , and moreover if this is also true for polynomials of degree then
[TABLE]
for any indices .
Proof.
We prove this statement by induction on . The case can be readily checked by testing the Stein identity on coordinates . Assume the statement holds for . To prove the statement for , it is enough to check it for monomials of degree , by the induction assumption. Up to relabeling, we can restrict to the case where the degree in is positive. Let such that and . Define . We have
[TABLE]
where we have used the symmetry of , and the moment assumption to match the second term. The indices in the last line corresponds to having times the indice , and the order does not matter by symmetry of . Since for a Gaussian measure the two integrals of moments match, the integral of the kernel must be zero as soon as the moment assumption is satisfied. ∎
In dimension one, when has a nice density with respect to the Lebesgue measure, we can give explicit formulas in terms of :
Proposition 2.2**.**
Let be a probability measure on with connected support, such that for all . Then the iterative formula
[TABLE]
defines Stein kernels, with the usual explicit formula for classical Stein kernels in dimension one.
We refer to [30] for a detailed study of 1st order kernels in dimension one. We shall not develop this point of view further, and focus on the situation in higher dimension, where this formula is no longer available. It turns out that, up to extra moment conditions, the arguments used in [15] for standard Stein kernels also apply. Before stating the conditions, we must first define Poincaré inequalities:
Definition 2.3**.**
A probability measure on satisfies a Poincaré inequality with constant if for all locally lipschitz function with we have
[TABLE]
Poincaré inequalities are a standard family of inequalities in stochastic analysis, with many applications, such as concentration inequalities and rates of convergence to equilibrium for stochastic processes. See [4, 5] and references therein for background information and conditions ensuring such an inequality holds.
Our basic existence result is the following:
Theorem 2.4**.**
Assume that satisfies a Poincaré inequality with constant , and that its moments of order less than match with those of the standard Gaussian. Then a Stein kernel of order exists, and moreover .
This theorem yields a sufficient condition for existence, but it is not necessary. Even in the case , we do not know of a useful full characterization of the situations where Stein kernels exist. Actually, [15] uses a more general type of functional inequality to ensure existence of a 1st order Stein kernel, but its extension to higher order kernels is a bit cumbersome, since the condition would iteratively require previous kernels to have a finite 2nd moment after multiplication with an extra weight.
Proof.
We proceed by induction. The case was proven in [15]. Assume that the statement is true for some , and that has moments of order less than matching with those of the Gaussian. Let be a Stein kernel of order for , which exists by the induction assumption. We wish to prove existence of . Consider the functional
[TABLE]
defined for . It is easy to check that, from the Euler-Lagrange equation for , if is a minimizer of , then satisfies (3).
From the Poincaré inequality and the fact that is centered due to the moment assumption, we have
[TABLE]
so that is a continuous linear form w.r.t. the norm . Hence from the Lax-Milgram theorem (or Riesz representation theorem) we deduce existence (and uniqueness) of a centered global minimizer , and is a suitable Stein kernel, and satisfies the symmetry assumption. Moreover,
[TABLE]
The induction assumption then yields ∎
2.2 Topology
In this section, we are interested in studying what distances between a probability measure and a Gaussian are controlled by our discrepancies. As is classical in Stein’s method, we seek to control a distance of the form
[TABLE]
where the class of test functions should be symmetric, and large enough to indeed separate probability measures. The total variation distance corresponds to the set of functions bounded by one, while the Kantorovitch-Wasserstein distance is obtained when considering the set of 1-lipschitz functions, thanks to the Kantorovitch-Rubinstein duality formula [33].
To relate such distances to Stein’s method, we introduce the Poisson equation
[TABLE]
The classical implementation is that if the solution satisfies a suitable regularity bound, then we can control
[TABLE]
by a type of Stein discrepancy. Due to the elliptic nature of the Ornstein-Uhlenbeck generator, the solution gains some regularity compared to . For example, if is -lipschitz, is [17]. Here, to control it by a Stein discrepancy of order , we shall have to differentiate several times the solution, and require it to satisfy a bound of the form . In particular, solutions to the Poisson equation should be smooth enough, which typically requires to be (this will be explained in more details in the proof of Theorem 2.5 below). Hence we introduce
We can now state a first result on the topology controlled by higher-order Stein discrepancies.
Theorem 2.5**.**
Let be a probability measure on whose first mixed moments match with those of a -dimensional standard centered Gaussian. Then
[TABLE]
The controlled distance can be thought of as a generalization of the Kantorovitch-Wasserstein distance, which corresponds to . It is known as the Zolotarev distance of order , and it controls the same topology as the Kantorovitch-Wasserstein distance [8], that is weak convergence and convergence of moments up to order .
Proof.
We first derive a regularity estimate for solutions of the Poisson equation. The scheme of proof below is a straightforward extension of the regularity bound of [14] in the case . Similar regularity bounds, in operator norm, for arbitrary where derived in [18]. In the case where is lipschitz, better regularity bounds (namely, bounds) were obtained in [17], and it should be possible to get better regularity bounds for general . However, for our purpose it is not clear that improved bounds would further help us here.
As pointed out by Barbour [7], a solution of the Poisson equation (5) is given by
[TABLE]
and after integrating by parts with respect to the Gaussian measure, its gradient can be represented as
[TABLE]
and hence higher-order derivatives are given by
[TABLE]
We then have for any
[TABLE]
Therefore
[TABLE]
We then have, for any function satisfying ,
[TABLE]
This concludes the proof. ∎
Remark 2.1**.**
In dimension one, the Ornstein-Uhlenbeck enjoys strictly better regularization properties, which would allow to control stronger distances.
3 Functional inequalities
Our first functional inequality is a generalization of the HSI inequality of [22].
Theorem 3.1** (HSI inequalities).**
Let . We have
[TABLE]
This inequality improves on the classical Gaussian logarithmic Sobolev inequality of Gross [21].
We introduce the Ornstein-Uhlenbeck semigroup
[TABLE]
where is a standard Gaussian random variable. The properties of this semigroup have been well-studied. In particular, as time goes to infinity, converges to , and the entropy and Fisher information are related by De Brujin’s formula:
[TABLE]
The key lemma at the core of our results is the following estimate on Fisher information along the flow:
Lemma 3.2**.**
For any , we have
[TABLE]
When , this estimate corresponds to the main result of [27], and played a core role in the proofs of the functional inequalities of [22]. This extension to higher orders will allow us to get more precise estimates when higher-order Stein kernels exist, i.e. under moment constraints.
Proof.
We have the commutation relation
[TABLE]
Following [22], we have a representation formula for the Fisher information along the Ornstein-Uhlenbeck flow:
[TABLE]
Applying the Cauchy-Schwarz inequality and integrating out in , we get the result. ∎
Proof of Theorem 3.1.
From (7) and the decay property of the Fisher information , we deduce that for any we have
[TABLE]
Using Lemma 3.2 on the second term, we get
[TABLE]
We optimize by taking such that if possible, and otherwise (which boils down to the usual logarithmic Sobolev inequality), and we get the result. We used the easy bound to simplify the expression. ∎
We can also obtain functional inequalities controlling the distance. Recall that in the case , [22] established the inequality
[TABLE]
which itself reinforced classical bounds on the distance via Stein’s method, and allows to get simple proofs of CLTs in distance, since Stein discrepancies turn out to me easier to estimate in some situations. Our result is the following variant involving higher-order discrepancies:
Theorem 3.3** ( transport inequalities).**
For , we have
[TABLE]
For , we have .
The first inequality will allow to improve the rate of convergence in the CLT in distance for measures having its moments of order 3 equal to zero. As we will later see, when , these inequalities are not satisfactory for applications to CLTs.
Proof.
As pointed out in [28], we have
[TABLE]
For , we have for any
[TABLE]
Optimizing in then leads to choosing such that . If , we end up with the bound , and this upper bound, is larger than . Otherwise, we bound it by , and the desired bound holds either way.
For , we similarly have
[TABLE]
and taking yields the result. The inequality could be improved, at the cost of clarity, but as far as we can see the sharper inequality obtained by this method does not significantly improve the outcomes in the applications. ∎
4 Improved rates of convergence in the classical CLT
We are interested in the rate of convergence of the law of (normalized) sums of iid random variables to their Gaussian limit. It is known that the rate of convergence in Wasserstein distance is of order in general, as soon as the fourth moment is finite [11]. However, it is possible to do a Taylor expansion of the distance as goes to infinity, and see that under moment constraints, the asymptotic rate of decay may improve. More precisely, [9, 10] shows that in dimension one, if the first moments of the random variables match with those of the standard Gaussian, then the Wasserstein distance (and the stronger relative entropy and Fisher information) asymptotically decays like . Non-asymptotic rates in dimension one were obtained in [20] using a variant of Stein’s method, and strong entropic rates under a Poincaré inequality and after regularization by convolution with a Gaussian measure were obtained in [24], still in dimension one. [2] gives a sharp non-improved rate of convergence in the entropic CLT in dimension one in the classical case (i.e. without the extra moment constraints satisfied), without any regularization. See also [6] for a multi-dimensional extension when the measure is additionally assumed to be log-concave.
It is possible to use Stein’s method to give simple proofs of this decay rate [29, 22]. In particular, [15] proves a monotone decay of the Stein discrepancy, which immediately implies the quantitative CLT as soon as the Stein discrepancy of a single variable is finite.
We consider the usual setting for the classical CLT: a sequence of iid random variables with distribution , and the normalized sum
[TABLE]
whose law we shall denote by .
The aim of this section is to show similar results for higher-order discrepancies. The starting point is the following construction of Stein kernels of the second type for sums of independent random variables, which is an immediate generalization of the same result for .
Lemma 4.1**.**
Let be a -th order Stein kernel for . Then
[TABLE]
is a -th order Stein kernel for .
Proof.
This can easily be checked by induction on via (4). The case is well-known [22]. ∎
As a consequence, we obtain bounds on the rate of convergence of the Stein discrepancies:
Corollary 4.2**.**
Assume that all the mixed moments of order less than of are the same as those of the standard Gaussian. Then
[TABLE]
We then obtain a rate of convergence in the multivariate CLT for the Zolotarev distances as an immediate consequence of the comparison from Theorem 2.5:
Corollary 4.3**.**
Assume that satisfies a Poincaré inequality with constant , and that all its mixed moments of order less than match with those of the standard Gaussian measure. Let be the law of . Then
[TABLE]
Such results have been in dimension one (and for random vectors with independent coordinates) in [18, 19]. See also [20] for related results.
Combined with the logarithmic Sobolev inequality and Lemma 3.2, this also yields a multi-dimensional extension of a result of [24] on improved entropic CLTs for regularized measures, with more explicit quantitative prefactors.
In the case , due to Theorem 3.3, we can upgrade the distance to , losing however a logarithmic factor:
Corollary 4.4**.**
Assume that all the mixed moments of order less than three of are the same as those of the standard Gaussian, and that its law satisfies a Poincaré inequality with constant . Then
[TABLE]
as soon as . If additionally the mixed fourth moments match with those of the Gaussian, we get
[TABLE]
Proof.
The first inequality is obtained by plugging the upper bounds on discrepancies in the bounds of Theorem 3.3, while using the fact that is increasing on . The second inequality is obtained by using the 2nd order kernels, and with our estimates using even higher order kernels does not improve the bounds. ∎
When , we only miss the sharp asymptotic rate of [9] by a logarithmic factor. However, under higher moment constraints we know that the asymptotic rate is much better than (at least in dimension one), so this result is not satisfactory.
For the entropy without regularization, we obtain the following rates under the assumption that mixed third moments are equal to zero:
Proposition 4.5**.**
Assume that the law of the satisfies a Poincaré inequality and that the moments of order less than three agree with those of the standard Gaussian measure. Then
[TABLE]
This eliminates a logarithmic factor from previous results of [22] in this particular case, but once again does not give the expected sharp decay rate under the moment assumptions.
Proof.
This estimate is obtained by applying the HSI inequality with and the fact that Fisher information is monotone along the CLT [1]. ∎
Acknowledgments: This work was supported by the Projects MESA (ANR-18-CE40-006) and EFI (ANR-17-CE40-0030) of the French National Research Agency (ANR), ANR-11-LABX-0040-CIMI within the program ANR-11-IDEX-0002-02 and the France-Berkeley Fund. I would also like to thank Guillaume Cébron, Thomas Courtade, Michel Ledoux and Gésine Reinert for discussions on this topic.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Artstein, K. Ball, F. Barthe and A. Naor, Solution of Shannon’s Problem on the Monotonicity of Entropy, J. Amer. Math. Soc. 17, 975-982 (2004).
- 2[2] S. Artstein, K. Ball, F. Barthe and A. Naor, On the Rate of Convergence in the Entropic Central Limit Theorem. Probab. Theory Relat. Fields 129, 381-390 (2004).
- 3[3] H. Airault, P. Malliavin, and F. Viens. Stokes formula on the Wiener space and n-dimensional Nourdin-Peccati analysis. J. Funct. Anal. , 258(5):1763–1783, 2010.
- 4[4] D. Bakry, I. Gentil and M. Ledoux, Analysis and geometry of Markov diffusion operators. Springer, Grundlehren der mathematischen Wissenschaften, Vol. 348, xx+552 (2014).
- 5[5] D. Bakry, F. Barthe, P. Cattiaux and A. Guillin, A simple proof of the Poincaré inequality in a large class of probability measures including log-concave cases. Elec. Comm. Prob. Vol. 13 60–66, 2008.
- 6[6] K. Ball and V. H. Nguyen, Entropy jumps for isotropic log-concave random vectors and spectral gap. Studia Math. 213, 1, 2012.
- 7[7] A. D. Barbour, Stein’s method for diffusion approximations. Probab. Theory Rel. Fields 84 (3), 297–322 (1990).
- 8[8] N. Belili and H. Heinich, Distances de Wasserstein et de Zolotarev. C. R. Acad. Sci. Paris , t. 330, Série I, p. 811–814, 2000.
