Entropic CLT for smoothed convolutions and associated entropy bounds
Sergey G. Bobkov, Arnaud Marsiglietti

TL;DR
This paper investigates how the entropy of sums of independent random variables behaves asymptotically when these variables are convolved with a small amount of continuous noise, revealing new entropy bounds.
Contribution
It introduces an entropic Central Limit Theorem for smoothed convolutions and derives associated entropy bounds, advancing understanding of entropy behavior under noise smoothing.
Findings
Asymptotic entropy behavior characterized for smoothed convolutions
New entropy bounds established for sums of independent variables
Enhanced understanding of entropy in noisy convolution scenarios
Abstract
We explore an asymptotic behavior of entropies for sums of independent random variables that are convolved with a small continuous noise.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Entropic CLT for smoothed convolutions
and associated entropy bounds
Sergey G. Bobkov1
and
Arnaud Marsiglietti2
Abstract.
We explore an asymptotic behavior of entropies for sums of independent random variables that are convolved with a small continuous noise.
Key words and phrases:
Central limit theorem, Entropic CLT
2010 Mathematics Subject Classification:
Primary 60E, 60F
School of Mathematics, University of Minnesota, Minneapolis, MN 55455 USA. Research was partially supported by NSF grant DMS-1855575.
Department of Mathematics, University of Florida, Gainesville, FL 32611, USA
1. Introduction
Let be independent, identically distributed (i.i.d.) random vectors in with an isotropic distribution, that is, with mean zero and an identity covariance matrix. By the central limit theorem (CLT), given a random vector in , independent of all ’s, the normalized sums
[TABLE]
are convergent weakly in distribution as to the standard normal random vector with density
[TABLE]
Suppose that has a finite second moment and an absolutely continuous distribution, so that have some densities . A natural question of interest is whether or not this property (that is, the weak CLT) may be strengthened as convergence of entropies
[TABLE]
to the entropy of the Gaussian limit . The usual entropic CLT corresponds to the i.i.d. case with . Then, this CLT is known to hold, if and only if have densities with finite for some or equivalently all large enough [2] (see also [27], [1], [17], [18], [19], [5], [14]). What also seems remarkable, the presence of a small non-zero noise in (1) may potentially enlarge the range of applicability of the entropic CLT. Here is one observation in this direction in terms of the characteristic function
[TABLE]
Theorem 1.1**.**
If is compactly supported, and has a non-lattice distribution, then
[TABLE]
This convergence also holds for lattice distributions, if is supported on the ball for some depending on the distribution of . One may take , assuming that the -rd absolute moment
[TABLE]
is finite.
The assumption of compactness on the support of the characteristic function of requires its density to be the restriction to of an entire function on of exponential type by Paley-Wiener theorems (cf. e.g. [29]).
The entropic CLT (3) may equivalently be stated as the convergence
[TABLE]
for the Kullback-Leibler distance (also called relative entropy or an informational divergence). It belongs to the family of so-called strong (informational) distances, which dominate many other metrics that are used in usual CLT’s about the weak convergence of probability distributions. As was mentioned to us by one of the referees, one immediate consequence from (3) is the CLT for non-smoothed normalized sums with respect to the Kantorovich transport distance (cf. Remark 4.4 for details).
In general, the hypothesis on the support of in Theorem 1.1 cannot be removed, but may be weakened by involving more delicate properties related to the location of zeros of the characteristic function. This may be seen from the following characterization in one important example under mild regularity assumptions on .
Theorem 1.2**.**
Suppose that has a uniform distribution on the discrete cube , that is, with independent Bernoulli coordinates. Let the characteristic function of satisfy
[TABLE]
where denotes the distance from the point to the lattice . Then, the entropic CLT (3) holds true, if and only if
[TABLE]
The second moment assumption on guarantees that has a bounded continuous derivative with its Euclidean length . The assumption of integrability of in (4) requires the density of to be continuous on . In dimension , the condition (4) is fulfilled, as long as both and are in . If , (4) is more complicated, but is fulfilled, for example, under decay assumptions such as
[TABLE]
holding for all with some constants and .
Although an information-theoretic meaning of the property (5) is not clear, it is indeed connected with the entropy functional . Namely, under the conditions (4)-(5), it turns out that the entropy has to be non-negative. This is emphasized in the next statement, where we drop the isotropy condition and extend the Bernoulli case to arbitrary integer valued random vectors. As before, we assume that is a continuous random vector in with finite second moment, which is independent of all ’s.
Theorem 1.3**.**
Let be a sequence of independent, integer valued random vectors, whose components have variance one. Then
[TABLE]
In particular, if as , then necessarily .
Actually, the independence assumption may further be weakened to the uncorrelatedness (as explained in Theorem 5.3 in the end of these notes).
We do not discuss here possible applications of the last conclusion in Theorem 1.3. Let us however stress that obtaining lower and upper bounds for the differential entropy, under various hypotheses or for different classes of probability distributions on the Euclidean space , is in itself an important and self-sufficient direction in information theory, which is motivated by many problems and is connected with other areas. For example, applications of lower bounds to rate-distortion theory and channel capacity were put forward in [23] (see also [12], [16], [22]). Let us also mention Bourgain’s slicing problem in asymptotic geometric analysis, cf. [9]. As a main conjecture, it states that, for any convex body in there is a hyperplane such that the -dimensional volume of the slice is bounded away from zero by a universal positive constant. It was shown in [6] that the latter may equivalently be formulated as the property that, if is a random vector in with an isotropic log-concave distribution then
[TABLE]
with some universal constant . Besides this conjecture, the past few years has seen a growing interest in the study of entropic inequalities as they shed new lights on fundamental problems in convex geometry (cf. e.g. [7], [11], [10]). We refer to the survey paper [21] for further details on the connections between entropic inequalities and geometric and functional inequalities.
The paper is organized as follows. We start in Section 2 with general upper and lower bounds on the Kullback-Leibler distance
[TABLE]
from the distribution of to the standard normal law in terms of the -distance
[TABLE]
Throughout, denotes a standard normal random vector in , thus with density as in (2) and with characteristic function
[TABLE]
As usual, the Euclidean space is endowed with the canonical inner product and the norm . These bounds are applied in Section 3 to express the entropic CLT as convergence of densities in . Theorem 1.1 and Theorem 1.2 (in a somewhat refined form) are proved in Section 4. Using Proposition 3.1, the proofs employ recent results obtained in [8] on local limit theorems with respect to the and -norms. Theorem 1.3 is proved in Section 5, where we also discuss the connection between entropy bounds and the entropic CLT.
2. General bounds on relative entropy
Throughout this section, let be a random vector in with density , and let be defined according to (8).
Proposition 2.1**.**
Suppose that . If , then
[TABLE]
with some constant depending on only. Moreover, if for some constant , then
[TABLE]
First we collect a few elementary large deviation bounds.
Lemma 2.2**.**
For any ,
- (a)
;
- (b)
.
Proof.
Clearly, follows from . To derive the second bound, write
[TABLE]
where denotes the volume of the unit ball in . Given , consider the function
[TABLE]
We have and
[TABLE]
Thus, is decreasing in some interval and is increasing in . Therefore, for all , if , that is, for
[TABLE]
Using (11), we obtain
[TABLE]
so
[TABLE]
∎
To get the upper bound (9), we also need to control the weighted quadratic tails in terms of the -distance .
Lemma 2.3**.**
If , then for all ,
[TABLE]
Proof.
We have
[TABLE]
The last integral is bounded by . Also, by the Cauchy inequality,
[TABLE]
where is the volume of the unit ball in . Here, . ∎
Lemma 2.4**.**
For all ,
[TABLE]
Proof.
In definition (8), we split the integration into the two regions. Using the inequality , , and applying the first bound of Lemma 2.2, we have
[TABLE]
For the second region , just write
[TABLE]
Combining these relations and noting that , we thus get
[TABLE]
∎
As a consequence, we obtain:
Lemma 2.5**.**
For all ,
[TABLE]
Proof.
We use the notation . Subtracting from and then adding, one can write
[TABLE]
Next, let us apply Cauchy’s inequality together with the bound so that to estimate the last integral from above by
[TABLE]
Here, using the first bound of Lemma 2.2, we have
[TABLE]
Therefore,
[TABLE]
To simplify, the last integrand may be bounded by
[TABLE]
so,
[TABLE]
Using this estimate in (12) together with for , we get
[TABLE]
∎
Proof of Proposition* 2.1.*
Combining Lemma 2.5 with Lemma 2.3, we immediately get
[TABLE]
To get (9), it remains to take here
[TABLE]
For the lower bound (10), let us recall that . By Taylor’s expansion, for all and , there is a point between and such that
[TABLE]
Inserting , , we obtain a measurable function with values between and , satisfying
[TABLE]
Let us integrate this equality over and use to get
[TABLE]
Hence
[TABLE]
It remains to use the assumptions and , so that as well. ∎
3. Topological properties of relative entropy
Applying Proposition 2.1 to a sequence of random vectors, we arrive at necessary and sufficient conditions for the convergence in the Kullback-Leibler distance in terms of the -distances
[TABLE]
More precisely, we have:
Proposition 3.1**.**
Let be a sequence of random vectors in with densities . Suppose that as
- (a)
;
- (b)
.
Then or equivalently as . Conversely, if are uniformly bounded, then the conditions are also necessary for the convergence in .
Before turning to the proof, let us recall a basic abstract definition of the Kullback-Leibler distance (i.e., relative entropy). Let and be random elements in a measurable space with distributions and , respectively. If is absolutely continuous with respect to and has density , the relative entropy of with respect to is defined as
[TABLE]
where in the last equality we assume that and have densities and with respect to the dominating measure on , so that (which is well-defined -almost everywhere). This definition does not depend on the choice of , and one may always take , for example. If is not absolutely continuous with respect to , one puts . For basic properties of this functional, we refer an interested reader to [15], and here only mention one well-known relation
[TABLE]
It holds for any measurable function on for which the right-hand side is finite (this relation easily follows from the elementary inequality , , ).
In the case where with Lebesgue measure , and choosing , , we have in particular
[TABLE]
If has a normal distribution, the last expectation is finite for some . Therefore, finiteness of forces the random vector in to have a finite second moment. One can now introduce an affine invariant functional
[TABLE]
where the infimum is running over all absolutely continuous normal distributions on . Thus, represents the Kullback-Leibler distance from the distribution of to the class of all non-degenerate Gaussian measures on . It is finite, only if the distribution of is absolutely continuous and has a finite second moment, and then this infimum is attained on the normal distribution with the same mean and covariance matrix as for (cf. e.g. [3], Section 10.7).
Our next step is to quantify the properties from Proposition 3.1 in terms of , where is a standard normal random vector in . Denote by the density of the normal law with these parameters, that is, let have density
[TABLE]
so that . By the definition, if has density , we have
[TABLE]
Simplifying, we obtain an explicit formula
[TABLE]
where are eigenvalues of the matrix (). Note that all the terms on the right-hand side are non-negative. This allows us to control the first two moments of in terms of . In particular, , so that the closeness of to in relative entropy implies the closeness of the means. To come to a similar conclusion about the covariance matrices, consider the non-negative convex function
[TABLE]
We have and . If , by Taylor’s formula about the point with some point between and ,
[TABLE]
For the values , we have a linear bound with some constant . Namely, write the latter inequality as , i.e., for . As easy to check, the function is decreasing on the whole positive axis, so in . Hence, one may take , and thus . The two bounds yield
[TABLE]
Let us summarize.
Lemma 3.2**.**
Given a random vector with mean and covariance matrix with eigenvalues , we have
[TABLE]
In particular, putting , we have
- (a)
;
- (b)
for all ;
- (c)
.
Here, the closeness of all to 1 may also be stated as closeness of to the identity matrix in the (squared) Hilbert-Schmidt norm . These bounds have an application in the problem where one needs to determine whether or not there is convergence in relative entropy for a sequence of random vectors.
Corollary 3.3**.**
Given a sequence of random vectors in with means and covariance matrices , the property as is equivalent to the next three conditions:
[TABLE]
Proof of Proposition* 3.1.*
First recall that
[TABLE]
Hence, if like in , then . To show that the conditions are sufficient for the convergence in , denote by the characteristic functions of . By the assumption and applying the Plancherel theorem,
[TABLE]
as . Define the random vectors , where (), so that . They have densities with characteristic functions
[TABLE]
Using and applying the Plancherel theorem once more together with the triangle inequality in , we then get
[TABLE]
Here, the last norm tends to zero, so, . We are in position to apply the upper bound (9) of Proposition 2.1 to which yields and thus
[TABLE]
Conversely, assuming that and applying Corollary 3.3, we get the property . Hence, , and according to the formula (14). By the assumption, are uniformly bounded, that is, with some constant . We are in position to apply the lower bound (10) which yields and therefore
[TABLE]
∎
4. Proof of Theorems 1.1-1.2
From now on, let the random vectors be defined as the normalized sums according to (1). The proof of Theorem 1.1 is based on the following statement obtained in [8].
Lemma 4.1**.**
([8, Theorem 1.3]) There exists depending on the distribution of with the following property. If is supported on the ball , then the random vectors have continuous densities such that
[TABLE]
If is finite, one may take . If has a non-lattice distribution, may be arbitrary.
Recall that, in Theorems 1.1-1.2 we assume that , which implies as . In addition, the uniform convergence (15) is stronger than
[TABLE]
since
[TABLE]
By Proposition 3.1, both properties ensure that , and we obtain Theorem 1.1.
Now, let us turn to the Bernoulli case, that is, when has a uniform distribution on the discrete cube . Theorem 1.2 may slightly be refined in one direction by weakening the condition (4). As before, denotes the distance from the point to the lattice .
Theorem 4.2**.**
Suppose that the characteristic function of satisfies
[TABLE]
together with
[TABLE]
Then we have the entropic CLT, that is, as . Conversely, if the entropic CLT holds together with
[TABLE]
then satisfies (17). In this case the uniform local limit theorem (15) takes place.
The point of the refinement is that (18) is weaker than (19), which is exactly the condition (4) in Theorem 1.2. In dimension , (18) is fulfilled whenever and are in (by Cauchy’s inequality), that is, when the density of the random variable satisfies
[TABLE]
(which holds automatically, if is bounded). If , (18) is fulfilled under the decay assumptions (6) with a weaker parameter constraint . This is the case, for example, where is uniformly distributed in the (solid) cube , while (19) does not hold. In [8], it was shown that the properties (17)-(18) imply the -convergence of densities (16), while (17) together with a stronger assumption (19) leads to the uniform convergence (15). Hence, we can apply Proposition 3.1 to conclude that . It was also shown there that the property (17) is fulfilled under the -convergence (16). In order to arrive at a similar conclusion under an apriori weaker entropic CLT, we involve the assumption (19) and prove here:
Lemma 4.3**.**
Suppose that has a uniform distribution on the discrete cube . If the condition (19) is fulfilled, then have uniformly bounded densities .
Having this assertion, we therefore complete the proof of Theorem 4.2 and of Theorem 1.2 by appealing to Proposition 3.1 once more.
Proof of Lemma* 4.3.*
Put for . By the assumption (19), the characteristic functions
[TABLE]
are integrable. Hence, have continuous densities given by the Fourier inversion formula
[TABLE]
Let us partition into the cubes , , , so that for . Splitting the integration in (20), we can write
[TABLE]
Putting and using the periodicity of the cosine function together with the bound for , we have
[TABLE]
By Taylor’s formula,
[TABLE]
Hence, changing the variable , we get
[TABLE]
with some constant depending on only, where for . Hence
[TABLE]
The next summation over all leads to
[TABLE]
where we applied the assumption (19). Put
[TABLE]
By (21),
[TABLE]
Hence, again changing the variable , and then , we get
[TABLE]
with some constant depending on the dimension, only. Performing summation over all , we get
[TABLE]
Due to (22), with some other -dependent constants
[TABLE]
and thus is bounded by a constant which does not depend on . ∎
Remark 4.4. To better realize the meaning of Theorem 1.1, let us also comment on the relationship between the entropic and transport CLT’s. Given two random vectors and in with distributions and respectively, the (quadratic) Kantorovich distance is defined as
[TABLE]
where the infimum is running over all (Borel) probability measures on with marginals and . It represents a metric in the space of all probability measures on with finite second moment, which is closely related to the weak topology. More precisely, given a sequence and a “point” in , the convergence holds true as if and only if are weakly convergent to , that is,
[TABLE]
for any bounded continuous function on , and (cf. e.g. [31], p. 212).
When is the standard Gaussian measure on , the relationship of with relative entropy was emphasized by Talagrand [30] who showed that
[TABLE]
holding for any random vector in with a standard normal random vector. Returning to the setting of Theorem 1.1, define the normalized sums
[TABLE]
By the classical CLT, the distributions of are weakly convergent to the Gaussian limit . Since also , the above characterization of the convergence in the space ensures that , which is a transport CLT. A similar conclusion can also be made on the basis of Theorem 1.1. Indeed, choose for a characteristic function supported on a suitable small ball , so that , by (3). Applying the Talagrand transport-entropy inequality, we get
[TABLE]
A similar approach was used in [4] to study the rate of convergence in the one-dimensional transport CLT under the 4-th moment assumption.
5. Entropy bounds
Let be a sequence of integer valued random vectors in , and let be a continuous random vector in with finite second moment, independent of this sequence. As before, we define the normalized sums
[TABLE]
As is well-known, when the second moment of a continuous random vector in is fixed, its entropy is maximized on the normal distribution with the same second moment (cf. e.g. [13]). In the case of independent and isotropic ’s, we have as . Hence , where is a standard normal random vector in . The argument to derive a similar bound is based on two elementary lemmas, which involve the discrete Shannon entropy
[TABLE]
Here, is a discrete random vector taking at most countably many values, say , with probabilities respectively.
Lemma 5.1**.**
Let be a continuous random vector, and let be a discrete random vector independent of , both with values in the Euclidean space . Then
[TABLE]
Lemma 5.1 can be derived implicitly from the ideas of [28] about the entropy of mixtures of discrete and continuous random variables. An explicit statement appears in [32, Lemma 11.2] (see also [26]). We include a proof for completeness:
Proof.
Denote by the density of and let for some finite or infinite sequence . Since and are independent, has density
[TABLE]
We use the convention if . Note that, if , then
[TABLE]
while in the case , we have
[TABLE]
Hence, for all ,
[TABLE]
We may therefore conclude that
[TABLE]
∎
Let us note that a recent sharpening of Lemma 5.1 appears in [25, Theorem III.1], where it is shown that
[TABLE]
where is the conditional entropy, reducing to on independence, and is the supremum of the total variation of the conditional densities from their “mixture complements”, necessarily .
The following lemma is standard and has been used in several applications (see [24]):
Lemma 5.2**.**
For any integer valued random variable with finite second moment,
[TABLE]
The proof of Lemma 5.2, that we include for completeness, also combines both discrete and differential entropy:
Proof.
Put , . Consider a continuous random variable with density defined to be
[TABLE]
In other words,
[TABLE]
Note that
[TABLE]
and similarly
[TABLE]
Hence Also,
[TABLE]
Now, since Gaussian distributions maximize the differential entropy for a fixed variance, we conclude that
[TABLE]
∎
We are now prepared to establish Theorem 1.3, in fact under somewhat weaker assumptions.
Theorem 5.3**.**
Given a sequence of random vectors with values in , independent of , assume that for each , the components , , are uncorrelated and have variance one. Then,
[TABLE]
Proof.
Putting and applying Lemma 5.1, we get
[TABLE]
Note that
[TABLE]
By the well-known subadditivity of entropy along components of a random vector (an abstract property on product spaces which is irrelevant to the independence assumption, cf. e.g. [20]), we have
[TABLE]
Here, the entropy functional on the left is applied to the -dimensional random vector, while on the right-hand side of this inequality we deal with one-dimensional entropies. For each , the -th component of the random vector represents the sum of uncorrelated integer valued random variables with variance one, so that . Hence, by (23) applied to , we have
[TABLE]
and therefore
[TABLE]
We conclude that
[TABLE]
∎
Acknowledgements. Research of S.B. was partially supported by the Simons Foundation and NSF grant DMS-1855575. The authors are grateful to both referees for useful comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Artstein, S.; Ball, K.; Barthe, F.; Naor, A. On the rate of convergence in the entropic central limit theorem. Probab. Theory Related Fields 129 (2004), no. 3, 381–390.
- 2[2] Barron, A. R. Entropy and the central limit theorem. Ann. Probab. 14, (1986), no. 1, 336–342.
- 3[3] Bishop, C. M. Pattern recognition and machine learning. Information Science and Statistics. Springer, New York, 2006. xx+738 pp.
- 4[4] Bobkov, S. G. Entropic approach to E. Rio’s central limit theorem for W 2 subscript 𝑊 2 W_{2} transport distance. Statist. Probab. Lett. 83 (2013), no. 7, 1644–1648.
- 5[5] Bobkov, S. G.; Chistyakov, G. P.; Götze, F. Rate of convergence and Edgeworth-type expansion in the entropic central limit theorem. Ann. Probab. 41 (2013), no. 4, 2479–2512.
- 6[6] Bobkov, S. G.; Madiman, M. The entropy per coordinate of a random vector is highly constrained under convexity conditions. IEEE Trans. Inform. Theory 57 (2011), no. 8, 4940–4954.
- 7[7] Bobkov, S. G.; Madiman, M. Reverse Brunn-Minkowski and reverse entropy power inequalities for convex measures. J. Funct. Anal. 262 (2012), no. 7, 3309–3339.
- 8[8] Bobkov, S. G.; Marsiglietti, A. Local limit theorems for smoothed Bernoulli and other convolutions. ar Xiv:1901.02984 [math.PR], Preprint. To appear in: Theory Probab. Appl.
