Maximal correlation and the rate of Fisher information convergence in the Central Limit Theorem
Oliver Johnson

TL;DR
This paper investigates how the Fisher information of scaled sums of i.i.d. variables converges in the CLT, linking it to maximal correlation eigenvalues and establishing convergence rates under certain spectral conditions.
Contribution
It introduces a novel connection between Fisher information convergence in the CLT and the spectral properties of maximal correlation eigenvalues, providing new convergence rate results.
Findings
Fisher information of scaled sums converges at an O(1/n) rate under spectral conditions.
A relationship between Fisher information behavior and the second-largest eigenvalue of maximal correlation.
Monotonicity of Fisher information is strengthened assuming eigenvalue inequalities.
Abstract
We consider the behaviour of the Fisher information of scaled sums of independent and identically distributed random variables in the Central Limit Theorem regime. We show how this behaviour can be related to the second-largest non-trivial eigenvalue associated with the Hirschfeld--Gebelein--R\'{e}nyi maximal correlation. We prove that assuming this eigenvalue satisfies a strict inequality, an rate of convergence and a strengthened form of monotonicity hold.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Maximal correlation and the rate of Fisher information convergence in the Central Limit Theorem
Oliver Johnson
Abstract
We consider the behaviour of the Fisher information of scaled sums of independent and identically distributed random variables in the Central Limit Theorem regime. We show how this behaviour can be related to the second-largest non-trivial eigenvalue of the operator associated with the Hirschfeld–Gebelein–Rényi maximal correlation. We prove that assuming this eigenvalue satisfies a strict inequality, an rate of convergence and a strengthened form of monotonicity hold.
1 Introduction
Consider independent and identically distributed (i.i.d.) random variables taking values in , with mean [math] and variance , and write for their sum. We assume that the have smooth densities, and consider the behaviour of the Fisher information in the Central Limit Theorem regime.
Definition 1.1**.**
For any random variable with absolutely continuous density we define the Fisher score function (with respect to location parameter) and Fisher information . Further, as in [23], we write the standardized Fisher information (standardized Fisher divergence)
[TABLE]
The quantity is scale–invariant, and is times the quantity sometimes referred to as Fisher divergence or as Fisher information distance. The non-negativity of is equivalent to the standard Cramér-Rao lower bound (see for example [33, Eq. (2.1)]), with equality holding if and only if is Gaussian. Hence, if is ‘small’, then intuitively should be ‘close to Gaussian’. In fact, controlling the standardized Fisher information gives a strong sense of convergence to Gaussian, with control of implying control of total variation distance, Hellinger distance and the supremum distance between densities (see [23, Lemma 1.5]) and relative entropy (see [23, p409]). The fact that absolute continuity is a sufficient condition for the existence of Fisher information is discussed for example in [21, Section 4.4].
We follow Courtade [13] in analysing Fisher information using quantities related to the (Hirschfeld–Gebelein–Rényi) maximal correlation [19, 20, 30]. It is well-known that the standard (Pearson) correlation coefficient only captures linear relationships between random variables, and hence can be zero even when and are dependent. In contrast, the maximal correlation between random variables , is the largest correlation between non-constant well-behaved functions of them
[TABLE]
Like the mutual information, is zero if and only if and are independent, see [19, 20, 30]. The maximal correlation has found application in information theory partly because of its relation to hypercontractivity and the strong data processing constant [2, 24].
Courtade [13] gave a direct and simple proof of the monotonicity of Fisher information in the Central Limit Theorem regime, using the fact that for i.i.d. the maximal correlation between sums of different sizes satisfies
[TABLE]
This fact, which we call the Dembo–Kagan–Shepp (DKS) identity [17], can be understood through an equivalent formulation of as the largest non-trivial singular value of conditional expectation operators (see Section 2). This identity was originally proved in [17] under the assumption that have finite variance, a condition subsequently relaxed in [12]. We note that Courtade’s proof [13] of monotonicity via the DKS identity only recovers the result along i.i.d. sequences, which is less general than the ‘leave-one-out’ inequality proved by Artstein, Ball, Barthe and Naor [4]. However, Courtade has subsequently shown that many monotonicity results, including the DKS identity and the general subset inequalities of Madiman and Barron [26] can be seen as immediate consequences of Shearer’s lemma [14].
In this paper we work with a quantity defined in terms of the second-largest non-trivial singular value of the same conditional expectation operators, defined in Definition 2.4 below, and satisfying by the Dembo–Kagan–Shepp identity [17]. Under a technical diagonalizability condition (Assumption 1 below, which is assumed to hold throughout) a more detailed analysis of using the Efron–Stein (ANOVA) decomposition [18] allows us to deduce the following result:
Theorem 1.2**.**
Consider i.i.d. with mean [math] and variance and smooth densities on . For any , writing for the quantity from Definition 2.4 below, then
[TABLE]
In other words, if then we achieve a convergence rate of standardized Fisher information.
Theorem 1.2 follows directly by combining Propositions 4.1 and 5.2 below. Note that Artstein, Ball, Barthe and Naor [3] and Johnson and Barron [23] both proved an rate of convergence of standardized Fisher information (and hence of relative entropy) for one-dimensional random variables assuming finiteness of the Poincaré constant (this was extended to the case by [5] under a stronger assumption of log-concavity). However, since (see Lemma 3.6 below), finiteness of the Poincaré constant implies , we can regard our condition as weaker. As with Poincaré constants, positivity condition implies finiteness of moments of all orders (see Proposition 3.5 below). However, unlike finiteness of Poincaré constants, positivity of does not directly require that the support of is connected.
To illustrate the relationship between moments and , we further prove a lower bound on the Fisher information which tightens the lower bound of [23, Lemma 1.4], and which complements the upper bound in Theorem 1.2:
Lemma 1.3**.**
For i.i.d. the standardized Fisher information satisfies
[TABLE]
where is the skewness of and .
The upper and lower bounds on given by (4) and (5) are compatible in the sense that (since by (20) the ) we know
[TABLE]
where the final inequality simply follows from the case of (5).
The need for finiteness of the Poincaré constant to ensure convergence of Fisher information and of relative entropy was removed in subsequent work of Bobkov, Chistyakov and Götze (see for example [9] for Fisher information and [7, 8] for relative entropy). These papers proved this rate of convergence under the assumption of finite fourth moment, as well as a variety of related results under a moment-matching assumption. Note that (by Lemma 3.3 below) if the fourth moment is infinite, our methods do not give convergence, so our results should be regarded as weaker. However, papers [7, 8] used a detailed argument involving Edgeworth expansions, truncation of densities and analysis of the characteristic function to derive their results. We believe our results are obtained in a more straightforward way, and the connection to maximal correlation in this context may be of independent interest. Further, we prove a novel strengthened form of monotonicity, Theorem 6.3, which places monotonicity and convergence results in the same framework, whereas they have often historically been treated separately.
An alternative perspective was provided by Courtade, Fathi and Pananjady [15], who weakened the Poincaré constant assumption to require only the existence of a Stein kernel (which holds for any centered random variable with connected support). Using this, they proved an rate of convergence in Wasserstein distance and an rate of convergence in relative entropy, with the speed of convergence being dictated by the Stein discrepancy (squared distance from the Stein kernel to the identity). This work has the considerable advantage of holding in more general spaces for . It would be of interest to understand the relationship between our condition and the Stein condition of [15].
The problem of proving information–theoretic versions of the Central Limit Theorem is a long-standing one, the early history of which is reviewed in [22]. In particular, we mention work of Linnik [25] and Shimizu [31]. However, our work follows the idea of studying projections of score functions, and follows a path first set out by Stam [33], Brown [11], Barron [6], as well as exploiting subsequent developments. In particular, the analysis of [23] exploited the fact that in the limit the score function of the limit must simultaneously be both a ridge function (a function ) and close to being the sum , and hence must be close to being linear.
This analysis generalized a key step in the work of Brown (and later in Barron [6]), which was an inequality [11, Lemma 3.1] concerning properties of Hermite polynomials, which are orthogonal in the Gaussian case. Our work can be seen as giving an alternative generalization of this, using an orthogonal function expansion based on the Singular Value Decomposition. The link between these two ideas is the fact that the Hermite polynomials provide the Singular Value Decomposition of conditional expectations in the Gaussian case (see [28, Theorem 3] and Example 3.1).
The structure of the remainder of the paper is as follows: in Section 2 we formally define the conditional expectation operators and the eigenvalue–related quantity . In Section 3 we give examples where we can calculate explicitly, discuss properties of and show how it relates to other quantities. In Section 4 we discuss how standard results allow us to control the value of the standardized Fisher information on convolution, in terms of . In Section 5 we discuss how to control higher order terms in the Dembo–Kagan–Shepp argument, and hence bound in terms of . In Section 6 we show how these arguments imply a stronger form of monotonicity of Fisher information. We conclude with some suggestions for future work in Section 7.
2 Conditional expectation operator definitions
We introduce notation based on [28]. For any probability measure we write for the Hilbert space endowed with inner product . Write and for the law of the relevant random variables, where as before .
Definition 2.1**.**
Define conditional expectation operator and its adjoint by:
[TABLE]
These maps are adjoint in the sense that (by direct calculation, or the tower law) for all and :
[TABLE]
Assumption 1**.**
We assume throughout this paper that the self-adjoint map is diagonalizable.
Definition 2.2**.**
Under Assumption 1 write for the basis of orthonormal eigenfunctions of , with corresponding eigenvalues and singular values . Here, without loss of generality, we assume that
[TABLE]
We write for the scaled images of these eigenfunctions.
Remark 2.3**.**
Note that by (8) the functions are orthonormal in . Further, note that
[TABLE] 2. 2.
Note that and the pair achieves the maximum correlation since by (8) we know
[TABLE] 3. 3.
In this i.i.d. case, we can take , with . This choice of functions has the relevant properties since by symmetry (or the fact that averages of i.i.d. random variables form a reverse martingale)
[TABLE]
and
[TABLE]
The DKS identity (3) tells us that no larger value of is possible.
The focus of this paper will be the quantity defined in terms of the second-highest non-trivial eigenvalue of the self-adjoint map as:
Definition 2.4**.**
Using the notation above, write
[TABLE]
The Dembo–Kagan–Shepp identity [17] means that for , eigenvalues are , which ensures that . While we are not aware of existing results in the literature that bound for , we remark that the higher order eigenfunctions and (for , for some fixed ) have been used in a manner similar to Principal Components Analysis to capture significant high-order features of datasets [27].
One possible strategy to show that is to show that and are compact operators (recall from e.g. [32, Section 3.1] that a compact linear operator is one for which the image of any bounded subset has compact closure). The Riesz–Schauder Theorem [32, Theorem 3.3.1] states that the only possible accumulation point of eigenvalues of a compact operator is at 0, so if the eigenspace corresponding to has dimension 1 then we can deduce . We consider the second point in Remark 5.5 below, and discuss the question of compactness now.
This compactness is stated as [10, Assumption 5.2], which states that it ‘is satisfied in most cases of interest’ and in particular if a sufficient condition [10, Eq. (5.4)] holds – we derive this condition here for completeness. As in [28, Eq. (40] we can expand the Radon-Nikodym derivative between joint and marginal densities using the Singular Value Decomposition as:
[TABLE]
Note that and , so
[TABLE]
where is symmetric, as expected. Then, is compact if this is a trace-class operator (see [32, Section 3.6]), or in other words that we can use Mercer’s Theorem ([32, Theorem 3.11.9]) to verify that
[TABLE]
(this is [10, Eq. (5.4)]).
Note that this quantity has the property that , where is an independent copy of and we write for the -divergence. Using this, an anonymous referee provided a prood of the following theorem, which shows that arbitrarily small Gaussian regularizations of sub-Gaussian random variables have the trace-class property:
Theorem 2.5**.**
For any , taking Gaussian with mean [math] and variance and , then writing for an independent copy of :
[TABLE]
Hence if is sub-Gaussian then for sufficiently large and hence is compact.
Proof.
See Appendix A. ∎
We briefly mention that by linearizing the logarithm, we can bound the mutual information . Further, , where the lower bound on follows by Shannon’s Entropy Power Inequality. In other words, finiteness of ensures a reverse Entropy Power Inequality of the form , where .
Example 2.6**.**
In the case where , we can explicitly write down , and direct calculation gives that
[TABLE]
This confirms the values in Example 3.1 below, which gives that the eigenvalues are , and so , confirming the value of the trace by Lidskii’s Theorem [32, Corollary 3.12.3].
Remark 2.7**.**
This formulation gives an alternative proof of the Dembo–Kagan–Shepp identity for , using the fact that . Fix and for function , write . Then
[TABLE]
Hence, for any with , Cauchy-Schwarz gives
[TABLE]
since . The result follows on multiplying by and integrating, to deduce that , or .
Note that (see also Remark 5.5 below) equality holds in (2.7) if and only if is constant in . Taking a derivative with respect to , we deduce that must be constant, or that linear is the unique eigenfunction achieving .
3 Conditional expectation operator properties
We now review two examples where we can explicitly calculate the eigenfunctions and eigenvalues of , using properties of orthogonal polynomials [1], and hence deduce the value of . Instead of orthogonal polynomials, these calculations can alternatively be performed using properties of the associated semigroups (Ornstein–Uhlenbeck and Laguerre semigroups, respectively). First, the Gaussian case (see also [28]):
Example 3.1**.**
If are Gaussian with variance , then and are orthonormalized Hermite polynomials. For any we define (where are the Hermite polynomials, which are orthogonal with respect to standard Gaussian weights). By adapting the addition formula [1, Eq. (22.12.8)] or by direct calculation using the generating function we know that for any , and :
[TABLE]
Taking in (15), and since for Gaussian with mean [math] and variance we know for , we can deduce that
[TABLE]
Taking with and we have as required.
For completeness, the property that follows since for fixed the , where is Gaussian with mean [math] and variance . Hence taking in the addition formula (15) we obtain
[TABLE]
where the final identity follows by definition of .
We deduce that and so , with in particular.
Next, we give a similar argument in the gamma distributed case. Note that although the do not have mean 0, the argument carries through essentially unchanged on centering.
Example 3.2**.**
If are distributed then, writing for the generalized Laguerre polynomials (orthogonal with respect to ), a similar addition formula [1, Eq. (22.12.6)] holds:
[TABLE]
For with and we deduce as required.
The property that follows by expressing the conditional density of in terms of a beta function and using [1, Eq. (22.13.13)]:
[TABLE]
to deduce that , and rescaling.
Hence and so , with in particular.
Note that (as we might expect) the larger the value of , the closer the value of obtained in Example 3.2 becomes to the value obtained for the Gaussian case in Example 3.1.
Next, motivated by the fact that in both the Gaussian and gamma cases the eigenfunction is quadratic, we use properties of quadratic functions to deduce an upper bound on involving third and fourth moments.
Lemma 3.3**.**
For i.i.d. with mean 0 and variance , define the scale-invariant quantity (kurtosis minus squared skewness minus ), which does not depend on . Then
[TABLE]
In particular, taking in (19) we deduce
[TABLE]
Proof.
Consider the function , where taking ensures that as required. Direct calculation shows that . Further, expanding the square we can show and (this expression as the expectation of a square ensures that holds). Since it is expressed as an infimum over all functions,
[TABLE]
as required. ∎
Remark 3.4**.**
We observe that:
Equation (20) shows that if then . Equivalently if , we know (and the Poincaré constant is infinite). 2. 2.
Note that the values of found in Examples 3.1 and 3.2 both satisfy (20) with equality, because the relevant eigenfunction is quadratic. In the Gaussian case Example 3.1, , consistent with the value . In the gamma case Example 3.2, , consistent with the value . 3. 3.
Note also that (20) means that if (which, roughly speaking, corresponds to having heavier tails than the Gaussian) then by (19) the (smaller than the value in the Gaussian case, Example 3.1).
Indeed, we can prove similar (if more involved) bounds which show that positivity of implies finiteness of all moments. As before,the following proposition implies that if and the th moment of is finite then the th moment of must be finite.
Proposition 3.5**.**
Writing for the th moment of and for its variance, there exist functions and (depending on moments of lower orders) such that
[TABLE]
Proof.
Write
[TABLE]
for the th moment of . As in Lemma 3.3, consider the function , where taking ensures that . Using (21) we can expand
[TABLE]
Substituting and taking expectations, we deduce that
[TABLE]
meaning that we can rewrite (22) as
[TABLE]
where
[TABLE]
Since by construction , we can deduce by independence of and that the cross terms vanish so that
[TABLE]
so that
[TABLE]
We deduce the result using (24) and the facts that and
[TABLE]
∎
Lemma 3.6**.**
Assuming , the finiteness of the Poincaré constant implies that . Indeed:
[TABLE]
Proof.
We can deduce this using [23, Proposition 2.1] which, for and i.i.d., gives that for any with and taking then
[TABLE]
for some , . The proof of [23, Proposition 2.1] states that . Further, by symmetry, the condition implies that , so the RHS of (26) is . Rearranging, we deduce that
[TABLE]
and the result follows on rearranging. ∎
4 Behaviour of the Fisher information on convolution
We now consider how the standardized Fisher information behaves on convolution, under a standard Central Limit Theorem scaling. That is, as in [13], we write . Note that in the i.i.d. regime, since (see e.g. [11, Eq. (2.3)]) we know that (scale-invariance of ).
Proposition 4.1**.**
For i.i.d. the standardized Fisher information satisfies
[TABLE]
Proof.
Observe that (see for example [13, Eq. (3)], [33]) that the score function of the sum satisfies
[TABLE]
which we can rewrite as . Hence if we expand the score function as a sum of eigenfunctions
[TABLE]
then Definition 2.2 gives that:
[TABLE]
Further, direct calculation using integration by parts gives that
[TABLE]
This means that, using the fact that (see Remark 2.3.2) the , with we can write the standardized score functions of and from (4) as sums of eigenfunctions starting at index 2, as:
[TABLE]
Then, direct calculation using the orthonormality of and gives that:
[TABLE]
using the fact that for by (10). ∎
We can use a similar argument to prove the lower bound on Fisher information, Lemma 1.3:
Proof of Lemma 1.3.
As in Lemma 3.3 consider the function where . As above, since
[TABLE]
Now considering the LHS of (34) using Cauchy-Schwarz, we deduce that
[TABLE]
since as before , and the result follows by rearrangement. ∎
This lower bound tightens [23, Lemma 1.4], which (in our notation) can be expressed as
[TABLE]
where the original result is expressed in terms of the excess kurtosis .
5 Higher order Dembo–Kagan–Shepp terms
Proposition 4.1 gives one part of the proof of Theorem 1.2. However, this result as stated is not particularly helpful, since the form of the dependence of on is not immediately clear. We complete the proof of Theorem 1.2 by proving Proposition 5.2 below, which allows us to control .
The key observation is that we can analyse higher order terms in the Dembo–Kagan–Shepp argument, following the proof of [17, Lemma 2].
Lemma 5.1**.**
Fix , and consider a function with . Then
[TABLE]
where and .
Proof.
We adopt the same notation as [17, Section 2]. As in [17, Eq. (14), (15)], we can perform an Efron–Stein (ANOVA) expansion [18] of and (using the same functions in each case) to obtain
[TABLE]
The key observation is that for any and any , direct comparison of the two terms gives
[TABLE]
with equality if and only if . Applying this to the Efron–Stein decompositions (36) and (37) we obtain
[TABLE]
as required. ∎
We now deduce a result which, when combined with Proposition 4.1 above, allows us to deduce the proof of Theorem 1.2:
Proposition 5.2**.**
The quantity is non-decreasing in . Specifically for any :
[TABLE]
Proof.
The key fact is that the function arising in Lemma 5.1 can be understood as the conditional expectation of both and (this is remarked at the foot of [17, P.345], and is due to orthogonality of the Efron–Stein decomposition). That is, for we can write
[TABLE]
since . Hence for any (and hence and ) we can write
[TABLE]
so the RHS of (38) becomes
[TABLE]
or dividing by and taking the optimal :
[TABLE]
and the result (39) follows on taking and . ∎
Note that we can weaken the assumption that to ensure convergence of Fisher information, to simply require that for some . If this is true, we can simply replace (39) by a bound of the form and substitute this in Proposition 4.1 instead.
Example 5.3**.**
In the Gaussian and gamma cases of Examples 3.1 and 3.2 the result of Proposition 5.2 is sharp. That is, if then recall that and hence . Similarly if then and .
This sharpness holds because in both Example 3.1 and 3.2 the optimal eigenfunction is quadratic, so in the Efron–Stein decomposition the .
Remark 5.4**.**
By combining Proposition 5.2 with Equation (19) we can deduce that is bounded above and below by linear functions in , assuming and are non-zero, as
[TABLE]
Remark 5.5**.**
Although not mentioned in [17], similar arguments show that under regularity conditions there should be a unique eigenfunction achieving eigenvalue (we know from Example 2.3.3 above that the linear functions achieve this). That is, assuming there is equality in
[TABLE]
if and only if . Hence there is equality in [17, Lemma 2] if and only if . Hence except on a set of measure 0 we know that
[TABLE]
Assuming is twice differentiable then taking a derivative with respect to and this implies for all , so is linear.
6 Strengthened monotonicity
We can extend the arguments above to deduce a stronger form of monotonicity of Fisher information than that obtained by [4] and [13], at least in the i.i.d. case:
Definition 6.1**.**
For , define by , and write for the ordered eigenvalues of , where and (by DKS [17] (3)) . Again, write .
Define a generalization of as
[TABLE]
As before, the Dembo–Kagan–Shepp identity [17] ensures that . Note we recover Definition 2.4 by taking . We now give a result which generalizes Proposition 4.1.
Proposition 6.2**.**
For i.i.d. the standardized Fisher information satisfies
[TABLE]
Proof.
We repeat the steps of the proof of Proposition 4.1. Again (see for example [13, Eq. (3)], [33]) the score function of the sum satisfies
[TABLE]
which we can rewrite as . Hence if we expand the score function
[TABLE]
then
[TABLE]
As before, direct calculation using integration by parts gives that
[TABLE]
Again (as in Remark 2.3.2) the , with so we can write the standardized score functions of and from (4) as sums of eigenfunctions starting at index 2, as:
[TABLE]
Just as before, we can use the orthonormality of and to deduce
[TABLE]
using the fact that for . ∎
As in [13], taking the Dembo–Kagan–Shepp bound in Proposition 6.2 we recover the monotonicity of standardized Fisher information [4]. However, we can obtain better bounds by taking and in Lemma 5.1 to obtain
[TABLE]
Rearranging, and optimizing over we deduce that
[TABLE]
Since this is an increasing function of , we can replace by the lower bound from (39) to obtain
[TABLE]
which, in Proposition 6.2 allows us to deduce the stronger form of monotonicity that:
Theorem 6.3**.**
Consider i.i.d. with mean [math] and variance and smooth densities on . Writing for the quantity from Definition 2.4, the standardized Fisher information has the property that
[TABLE]
Note that this is a simultaneous strengthening of Theorem 1.2 and of the monotonicity of Fisher information proved in the i.i.d. case by Artstein et al. [4] and [13].
7 Future work
We briefly mention some future directions for research. Note that some progress is made towards 1. and 2. in Appendix A below:
In order to increase the value of these results, it is a natural question to ask for sufficient conditions (in terms of the density or other related quantities) under which , and indeed to give explicit bounds of the form for some . 2. 2.
Additionally, it would be of value to give conditions on under which we can bound uniformly away from [math] for all , for random variables of the form , where is an independent Gaussian perturbation. Such a result would allow us to derive convergence of relative entropy using the de Bruijn identity [33]. 3. 3.
Since the monotonicity of entropy is equivalent to strengthened forms of Shannon’s Entropy Power Inequality (see [4, 26, 34]), it would be of interest to know if the strengthened monotonicity result Theorem 6.3 implies a stronger Entropy Power Inequality. 4. 4.
The results of this paper very much rely on the i.i.d. assumption. It is of interest to weaken this to the independent, but not identical setting, and indeed to dependent random variables, for example in the exchangeable setting. For example, Peccati [29] shows that a decomposition of the Efron–Stein type used to establish the Dembo–Kagan–Shepp identity holds if an exchangeable sequence has the ‘weak independence’ property. It is a natural question whether the results of this paper hold in that setting. 5. 5.
Following recent trends in information-theoretic Central Limit Theorems, it would be of interest to extend the results of this paper to the setting of , and to understand the behaviour of the eigenfunctions of in this setting, where the equivalent of (29) still holds (see e.g. [22, Lemma 3.4]).
Appendix A Proof of Theorem 2.5
The following argument was provided by an anonymous referee, for which the author is extremely grateful.
We write for a Gaussian density centred at with covariance matrix . As before, we write for the -divergence. We first state two lemmas:
Lemma A.1**.**
For any coupling of and :
[TABLE]
Proof.
Follows immediately from the joint convexity of -divergences (see for example [16, Lemma 4.1]). ∎
Lemma A.2**.**
For any , write for the two dimensional identiy matrix, and define the positive semi-definite matrix
[TABLE]
then for any and :
[TABLE]
Proof.
The key is that for we can express the ratio
[TABLE]
as a product of Gaussian densities, which integrate to 1. ∎
Proof of Theorem 2.5.
Write for i.i.d. copies of and independent , and define regularized . Define and . Further, define and to be independent copies of and respectively, write , and define and .
Using the invariance of -divergences under mappings we can write
[TABLE]
For any , we can consider the coupling between and given by , and note that . This means that we can express the -divergence arising in the formula for as
[TABLE]
where we apply Lemma A.1 followed by Lemma A.2, and where .
If is sub-Gaussian, then (see for example [35, Proposition 2.6.1]) so is , and hence (see [35, Proposition 2.5.2]) there exists a constant such that the moment generating function of satisfies
[TABLE]
when . Hence taking we deduce that (49) is bounded above by
[TABLE]
assuming that , or equivalently . ∎
Observe that we can use (49) to deduce asymptotic bounds on for sub-Gaussian random variables under sufficiently large amounts of Gaussian regularization. That is, if we write or the eigenvalues of the operator , then since we can express the trace as
[TABLE]
so that Taylor expanding the exponential
[TABLE]
using the fact that (see [35, P.26]) . Hence we deduce:
Corollary A.3**.**
Writing where is sub-Gaussian and , and writing for the eigenvalues of the operator we deduce:
[TABLE]
and hence for sufficiently large for sub-Gaussian random variables regularized by a sufficiently large amount.
Note that this allows us to deduce convergence of Fisher information for random variables of this type. Indeed, it allows us to deduce convergence of relative entropy using the de Bruijn identity (see for example [22, Eq. (1.110]) which expresses the relative entropy of a random variable with density to a standard Gaussian density as the integral of standardized Fisher information
[TABLE]
Using (50), since adding provides extra regularization we can deduce bounds for all on the second–largest eigenvalue of the form and combining Theorem 1.2 with (51) we deduce convergence of relative entropy, using the fact that
[TABLE]
for .
Acknowledgements
The author would like to thank Professor Thomas Courtade of the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley for extremely helpful discussions regarding this work, and for numerous pointers to relevant papers in the literature. I would also like to thank Professor Venkat Anantharam of the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley for valuable suggestions concerning the maximal correlation. The idea to consider the eigenfunctions in the maximal correlation problem grew out of a Twitter conversation with Dr James V Stone, Honorary Reader in Vision and Computational Neuroscience at the University of Sheffield. The author would like to thank the Associate Editor, and three anonymous referees for their close reading of this paper and extremely helpful suggestions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables , volume 55 of National Bureau of Standards Applied Mathematics Series . U.S. Government Printing Office, 1964.
- 2[2] V. Anantharam, A. Gohari, S. Kamath, and C. Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover, 2013. See: ar Xiv:1304.6133 .
- 3[3] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. On the rate of convergence in the entropic central limit theorem. Probab. Theory Related Fields , 129(3):381–390, 2004.
- 4[4] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. Solution of Shannon’s problem on the monotonicity of entropy. J. Amer. Math. Soc. , 17(4):975–982 (electronic), 2004.
- 5[5] K. Ball and V. H. Nguyen. Entropy jumps for isotropic log-concave random vectors and spectral gap. Studia Mathematica , 213:81–96, 2012.
- 6[6] A. R. Barron. Entropy and the Central Limit Theorem. Ann. Probab. , 14(1):336–342, 1986.
- 7[7] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Rate of convergence and Edgeworth-type expansion in the entropic central limit theorem. Ann. Probab. , 41(4):2479–2512, 2013.
- 8[8] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Berry–Esseen bounds in the entropic central limit theorem. Probability Theory and Related Fields , 159(3-4):435–478, 2014.
