Maximal correlation and the rate of Fisher information convergence in   the Central Limit Theorem

Oliver Johnson

arXiv:1905.11913·cs.IT·September 19, 2023

Maximal correlation and the rate of Fisher information convergence in the Central Limit Theorem

Oliver Johnson

PDF

TL;DR

This paper investigates how the Fisher information of scaled sums of i.i.d. variables converges in the CLT, linking it to maximal correlation eigenvalues and establishing convergence rates under certain spectral conditions.

Contribution

It introduces a novel connection between Fisher information convergence in the CLT and the spectral properties of maximal correlation eigenvalues, providing new convergence rate results.

Findings

01

Fisher information of scaled sums converges at an O(1/n) rate under spectral conditions.

02

A relationship between Fisher information behavior and the second-largest eigenvalue of maximal correlation.

03

Monotonicity of Fisher information is strengthened assuming eigenvalue inequalities.

Abstract

We consider the behaviour of the Fisher information of scaled sums of independent and identically distributed random variables in the Central Limit Theorem regime. We show how this behaviour can be related to the second-largest non-trivial eigenvalue associated with the Hirschfeld--Gebelein--R\'{e}nyi maximal correlation. We prove that assuming this eigenvalue satisfies a strict inequality, an $O (1/ n)$ rate of convergence and a strengthened form of monotonicity hold.

Equations181

J_{st} (X) := (Var X) E (ϱ_{X} (X) + \frac{X}{Var X})^{2} = (Var X) J (X) - 1.

J_{st} (X) := (Var X) E (ϱ_{X} (X) + \frac{X}{Var X})^{2} = (Var X) J (X) - 1.

ρ_{m a x} (U, V) := f, g sup ρ_{P C C} (f (U), g (V)) .

ρ_{m a x} (U, V) := f, g sup ρ_{P C C} (f (U), g (V)) .

\rho_{\max}(S_{m},S_{n})=\sqrt{\frac{m}{n}}\mbox{\;\;\; for all $m\leq n$}.

\rho_{\max}(S_{m},S_{n})=\sqrt{\frac{m}{n}}\mbox{\;\;\; for all $m\leq n$}.

J_{st} (\frac{Y _{1} + \dots + Y _{n}}{n}) \leq \frac{J _{st} ( Y )}{1 + Θ ^{(2)} ( n - 1 )} .

J_{st} (\frac{Y _{1} + \dots + Y _{n}}{n}) \leq \frac{J _{st} ( Y )}{1 + Θ ^{(2)} ( n - 1 )} .

J_{st} (U_{n}) \geq \frac{γ _{3}^{2}}{Σ + 2 ( n - 1 )},

J_{st} (U_{n}) \geq \frac{γ _{3}^{2}}{Σ + 2 ( n - 1 )},

\frac{J _{st} ( Y )}{1 + Θ ^{(2)} ( n - 1 )} \geq \frac{J _{st} ( Y )}{1 + ( 2/Σ ) ( n - 1 )} = \frac{Σ J _{st} ( Y )}{Σ + 2 ( n - 1 )} \geq \frac{γ _{3}^{2}}{Σ + 2 ( n - 1 )},

\frac{J _{st} ( Y )}{1 + Θ ^{(2)} ( n - 1 )} \geq \frac{J _{st} ( Y )}{1 + ( 2/Σ ) ( n - 1 )} = \frac{Σ J _{st} ( Y )}{Σ + 2 ( n - 1 )} \geq \frac{γ _{3}^{2}}{Σ + 2 ( n - 1 )},

(C_{(n)} f) (s)

(C_{(n)} f) (s)

(C_{(n)}^{*} g) (y)

⟨ g, C_{(n)} f ⟩_{P_{S_{n}}} = ⟨ C_{(n)}^{*} g, f ⟩_{P_{Y}} = E [f (Y_{i}) g (S_{n})] .

⟨ g, C_{(n)} f ⟩_{P_{S_{n}}} = ⟨ C_{(n)}^{*} g, f ⟩_{P_{Y}} = E [f (Y_{i}) g (S_{n})] .

1 = λ_{0}^{(n)} \geq λ_{1}^{(n)} \geq \dots \geq 0.

1 = λ_{0}^{(n)} \geq λ_{1}^{(n)} \geq \dots \geq 0.

C_{(n)}^{*} (g_{k}^{(n)}) = \frac{1}{μ _{k}^{(n)}} (C_{(n)}^{*} C_{(n)} f_{k}^{(1)}) = μ_{k}^{(n)} f_{k}^{(1)} .

C_{(n)}^{*} (g_{k}^{(n)}) = \frac{1}{μ _{k}^{(n)}} (C_{(n)}^{*} C_{(n)} f_{k}^{(1)}) = μ_{k}^{(n)} f_{k}^{(1)} .

E (f_{1}^{(1)} (Y_{i}) g_{1}^{(n)} (S)) = ⟨ g_{1}^{(n)}, (C_{(n)} f_{1}^{(1)}) ⟩_{P_{S}} = ⟨ g_{1}^{(n)}, μ_{1}^{(n)} g_{1}^{(n)} ⟩_{P_{S}} = μ_{1}^{(n)} .

E (f_{1}^{(1)} (Y_{i}) g_{1}^{(n)} (S)) = ⟨ g_{1}^{(n)}, (C_{(n)} f_{1}^{(1)}) ⟩_{P_{S}} = ⟨ g_{1}^{(n)}, μ_{1}^{(n)} g_{1}^{(n)} ⟩_{P_{S}} = μ_{1}^{(n)} .

(C_{(n)} f_{1}^{(1)}) (s) = \frac{1}{σ} E (Y_{1} ∣ S_{n} = s) = \frac{s}{σ n} = μ_{1}^{(n)} g_{1}^{(n)} (s),

(C_{(n)} f_{1}^{(1)}) (s) = \frac{1}{σ} E (Y_{1} ∣ S_{n} = s) = \frac{s}{σ n} = μ_{1}^{(n)} g_{1}^{(n)} (s),

(C_{(n)}^{*} g_{1}^{(n)}) (y) = \frac{1}{σ n} E (y + Y_{2} + \dots + Y_{n}) = \frac{1}{σ n} y = μ_{1}^{(n)} f_{1}^{(1)} (y) .

(C_{(n)}^{*} g_{1}^{(n)}) (y) = \frac{1}{σ n} E (y + Y_{2} + \dots + Y_{n}) = \frac{1}{σ n} y = μ_{1}^{(n)} f_{1}^{(1)} (y) .

Θ^{(n)} := \frac{1}{n λ _{2}^{(n)}} - 1 = h : E h (S_{n}) = E S_{n} h (S_{n}) = 0 in f \frac{1}{n} \frac{E ( h ( S _{n} ) ^{2} )}{E ( ( C _{(n)}^{*} h ) ( Y ) ^{2} )} - 1.

Θ^{(n)} := \frac{1}{n λ _{2}^{(n)}} - 1 = h : E h (S_{n}) = E S_{n} h (S_{n}) = 0 in f \frac{1}{n} \frac{E ( h ( S _{n} ) ^{2} )}{E ( ( C _{(n)}^{*} h ) ( Y ) ^{2} )} - 1.

τ_{n} (y, s) := \frac{p _{Y_{1}, S_{n}} ( y , s )}{p _{Y_{1}} ( y ) p _{S_{n}} ( s )} = k = 0 \sum \infty μ_{k}^{(n)} f_{k}^{(1)} (y) g_{k}^{(n)} (s) .

τ_{n} (y, s) := \frac{p _{Y_{1}, S_{n}} ( y , s )}{p _{Y_{1}} ( y ) p _{S_{n}} ( s )} = k = 0 \sum \infty μ_{k}^{(n)} f_{k}^{(1)} (y) g_{k}^{(n)} (s) .

(C_{(n)}^{*} C_{(n)} f) (y) = \int f (z) p_{Y} (z) L_{n} (z, y) d z,

(C_{(n)}^{*} C_{(n)} f) (y) = \int f (z) p_{Y} (z) L_{n} (z, y) d z,

T_{n} (Y) := \int p_{Y} (y) L_{n} (y, y) d y = \iint p_{Y} (y) p_{S_{n}} (s) τ_{n} (y, s)^{2} d y d s < \infty

T_{n} (Y) := \int p_{Y} (y) L_{n} (y, y) d y = \iint p_{Y} (y) p_{S_{n}} (s) τ_{n} (y, s)^{2} d y d s < \infty

D_{χ^{2}} (p_{Y_{1}, S_{n}} ∥ p_{Y} \times p_{S_{n}}) \leq \frac{1}{1 - 1/ n} E (exp (\frac{( X _{1} - X _{1}^{'} ) ^{2}}{( n - 1 ) δ ^{2}})) .

D_{χ^{2}} (p_{Y_{1}, S_{n}} ∥ p_{Y} \times p_{S_{n}}) \leq \frac{1}{1 - 1/ n} E (exp (\frac{( X _{1} - X _{1}^{'} ) ^{2}}{( n - 1 ) δ ^{2}})) .

T_{2} = \iint p_{Y} (y) p_{S_{2}} (s) τ_{2} (y, s)^{2} d y d s = 2.

T_{2} = \iint p_{Y} (y) p_{S_{2}} (s) τ_{2} (y, s)^{2} d y d s = 2.

C_{(n)} \overline{f} (s) = \int τ_{2} (z, s) p_{Y} (z) \overline{f} (z) d z = \int τ_{2} (s - z, s) p_{Y} (s - z) f (s - z) d z = C_{(n)} f (s) .

C_{(n)} \overline{f} (s) = \int τ_{2} (z, s) p_{Y} (z) \overline{f} (z) d z = \int τ_{2} (s - z, s) p_{Y} (s - z) f (s - z) d z = C_{(n)} f (s) .

(2 C_{(n)} f (s))^{2}

(2 C_{(n)} f (s))^{2}

H_{m}^{(n τ^{2})} (x + y) = k = 0 \sum m (k m) (\frac{n - 1}{n})^{k /2} (\frac{1}{n})^{(m - k) /2} H_{m - k}^{(τ^{2})} (x) H_{k}^{((n - 1) τ^{2})} (y) .

H_{m}^{(n τ^{2})} (x + y) = k = 0 \sum m (k m) (\frac{n - 1}{n})^{k /2} (\frac{1}{n})^{(m - k) /2} H_{m - k}^{(τ^{2})} (x) H_{k}^{((n - 1) τ^{2})} (y) .

C_{(n)}^{*} H_{m}^{(n σ^{2})} (x) = E H_{m}^{(n σ^{2})} (x + Z) = \frac{1}{n ^{m /2}} H_{m}^{(σ^{2})} (x) .

C_{(n)}^{*} H_{m}^{(n σ^{2})} (x) = E H_{m}^{(n σ^{2})} (x + Z) = \frac{1}{n ^{m /2}} H_{m}^{(σ^{2})} (x) .

C_{(n)} H_{m}^{(σ^{2})} (s) = E H_{m}^{(σ^{2})} (s / n + Z) = \frac{1}{n ^{m /2}} H_{m}^{(σ^{2} / n)} (s / n) = \frac{1}{n ^{m /2}} H_{m}^{(n σ^{2})} (s),

C_{(n)} H_{m}^{(σ^{2})} (s) = E H_{m}^{(σ^{2})} (s / n + Z) = \frac{1}{n ^{m /2}} H_{m}^{(σ^{2} / n)} (s / n) = \frac{1}{n ^{m /2}} H_{m}^{(n σ^{2})} (s),

L_{m}^{(α + β + 1)} (x + y) = i = 0 \sum m L_{i}^{(α)} (x) L_{m - i}^{(β)} (y) .

L_{m}^{(α + β + 1)} (x + y) = i = 0 \sum m L_{i}^{(α)} (x) L_{m - i}^{(β)} (y) .

Γ (β n + k) \int_{0}^{z} (z - y)^{β (n - 1) - 1} y^{β - 1} L_{k}^{(β - 1)} (y) d y = Γ (β + k) Γ (β (n - 1)) z^{β n - 1} L_{k}^{(β n - 1)} (z),

Γ (β n + k) \int_{0}^{z} (z - y)^{β (n - 1) - 1} y^{β - 1} L_{k}^{(β - 1)} (y) d y = Γ (β + k) Γ (β (n - 1)) z^{β n - 1} L_{k}^{(β n - 1)} (z),

Θ^{(n)} \leq \frac{2 ( n - 1 )}{Σ},

Θ^{(n)} \leq \frac{2 ( n - 1 )}{Σ},

Θ^{(2)} \leq \frac{2}{Σ} .

Θ^{(2)} \leq \frac{2}{Σ} .

Θ^{(n)} \leq \frac{E h ( S _{n} ) ^{2}}{n E ( C _{(n)}^{*} h ) ( Y ) ^{2}} - 1 = \frac{2 ( n - 1 )}{Σ},

Θ^{(n)} \leq \frac{E h ( S _{n} ) ^{2}}{n E ( C _{(n)}^{*} h ) ( Y ) ^{2}} - 1 = \frac{2 ( n - 1 )}{Σ},

Θ^{(2)} \leq \frac{σ ^{2} m _{2 k - 2} + B _{1, k} ( m _{1} , \dots , m _{2 k - 1} )}{2 m _{2 k} + B _{2, k} ( m _{1} , \dots , m _{2 k - 2} )} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Maximal correlation and the rate of Fisher information convergence in the Central Limit Theorem

Oliver Johnson

Abstract

We consider the behaviour of the Fisher information of scaled sums of independent and identically distributed random variables in the Central Limit Theorem regime. We show how this behaviour can be related to the second-largest non-trivial eigenvalue of the operator associated with the Hirschfeld–Gebelein–Rényi maximal correlation. We prove that assuming this eigenvalue satisfies a strict inequality, an $O(1/n)$ rate of convergence and a strengthened form of monotonicity hold.

1 Introduction

Consider independent and identically distributed (i.i.d.) random variables $Y_{i}\sim Y$ taking values in $\mathbb{R}$ , with mean [math] and variance $\sigma^{2}<\infty$ , and write $S_{n}:=Y_{1}+Y_{2}+\ldots+Y_{n}$ for their sum. We assume that the $Y_{i}$ have smooth densities, and consider the behaviour of the Fisher information in the Central Limit Theorem regime.

Definition 1.1.

For any random variable $X\in\mathbb{R}$ with absolutely continuous density $p_{X}$ we define the Fisher score function (with respect to location parameter) $\varrho_{X}(x):=p_{X}^{\prime}(x)/p_{X}(x)$ and Fisher information $J(X):=\mathbb{E}\varrho_{X}(X)^{2}=\int p_{X}(x)\varrho_{X}(x)^{2}dx$ . Further, as in [23], we write the standardized Fisher information (standardized Fisher divergence)

[TABLE]

The quantity $J_{\rm st}(X)$ is scale–invariant, and is $({\rm Var\;}X)$ times the quantity sometimes referred to as Fisher divergence or as Fisher information distance. The non-negativity of $J_{\rm st}$ is equivalent to the standard Cramér-Rao lower bound (see for example [33, Eq. (2.1)]), with equality holding if and only if $X$ is Gaussian. Hence, if $J_{\rm st}(X)$ is ‘small’, then intuitively $X$ should be ‘close to Gaussian’. In fact, controlling the standardized Fisher information gives a strong sense of convergence to Gaussian, with control of $J_{\rm st}$ implying control of total variation distance, Hellinger distance and the supremum distance between densities (see [23, Lemma 1.5]) and relative entropy (see [23, p409]). The fact that absolute continuity is a sufficient condition for the existence of Fisher information is discussed for example in [21, Section 4.4].

We follow Courtade [13] in analysing Fisher information using quantities related to the (Hirschfeld–Gebelein–Rényi) maximal correlation $\rho_{\max}$ [19, 20, 30]. It is well-known that the standard (Pearson) correlation coefficient $\rho_{PCC}(U,V)$ only captures linear relationships between random variables, and hence can be zero even when $U$ and $V$ are dependent. In contrast, the maximal correlation between random variables $U$ , $V$ is the largest correlation between non-constant well-behaved functions of them

[TABLE]

Like the mutual information, $\rho_{\max}(U,V)$ is zero if and only if $U$ and $V$ are independent, see [19, 20, 30]. The maximal correlation has found application in information theory partly because of its relation to hypercontractivity and the strong data processing constant [2, 24].

Courtade [13] gave a direct and simple proof of the monotonicity of Fisher information in the Central Limit Theorem regime, using the fact that for i.i.d. $Y_{i}$ the maximal correlation between sums of different sizes satisfies

[TABLE]

This fact, which we call the Dembo–Kagan–Shepp (DKS) identity [17], can be understood through an equivalent formulation of $\rho_{\max}$ as the largest non-trivial singular value of conditional expectation operators (see Section 2). This identity was originally proved in [17] under the assumption that $Y_{i}$ have finite variance, a condition subsequently relaxed in [12]. We note that Courtade’s proof [13] of monotonicity via the DKS identity only recovers the result along i.i.d. sequences, which is less general than the ‘leave-one-out’ inequality proved by Artstein, Ball, Barthe and Naor [4]. However, Courtade has subsequently shown that many monotonicity results, including the DKS identity and the general subset inequalities of Madiman and Barron [26] can be seen as immediate consequences of Shearer’s lemma [14].

In this paper we work with a quantity $\Theta^{(n)}$ defined in terms of the second-largest non-trivial singular value of the same conditional expectation operators, defined in Definition 2.4 below, and satisfying $\Theta^{(n)}\geq 0$ by the Dembo–Kagan–Shepp identity [17]. Under a technical diagonalizability condition (Assumption 1 below, which is assumed to hold throughout) a more detailed analysis of $\Theta^{(n)}$ using the Efron–Stein (ANOVA) decomposition [18] allows us to deduce the following result:

Theorem 1.2.

Consider i.i.d. $Y_{i}\sim Y$ with mean [math] and variance $\sigma^{2}<\infty$ and smooth densities on $\mathbb{R}$ . For any $n$ , writing $\Theta^{(2)}$ for the quantity from Definition 2.4 below, then

[TABLE]

In other words, if $\Theta^{(2)}>0$ then we achieve a $O(1/n)$ convergence rate of standardized Fisher information.

Theorem 1.2 follows directly by combining Propositions 4.1 and 5.2 below. Note that Artstein, Ball, Barthe and Naor [3] and Johnson and Barron [23] both proved an $O(1/n)$ rate of convergence of standardized Fisher information (and hence of relative entropy) for one-dimensional random variables assuming finiteness of the Poincaré constant (this was extended to the $\mathbb{R}^{d}$ case by [5] under a stronger assumption of log-concavity). However, since (see Lemma 3.6 below), finiteness of the Poincaré constant implies $\Theta^{(2)}>0$ , we can regard our condition as weaker. As with Poincaré constants, positivity condition $\Theta^{(2)}>0$ implies finiteness of moments of all orders (see Proposition 3.5 below). However, unlike finiteness of Poincaré constants, positivity of $\Theta^{(2)}$ does not directly require that the support of $Y$ is connected.

To illustrate the relationship between moments and $J_{\rm st}$ , we further prove a lower bound on the Fisher information which tightens the lower bound of [23, Lemma 1.4], and which complements the upper bound in Theorem 1.2:

Lemma 1.3.

For i.i.d. $Y_{1},\ldots,Y_{n}\sim Y$ the standardized Fisher information satisfies

[TABLE]

where $\gamma_{3}=\mathbb{E}Y^{3}/\sigma^{3}$ is the skewness of $Y$ and $\Sigma=\mathbb{E}Y^{4}/\sigma^{4}-(\mathbb{E}Y^{3}/\sigma^{3})^{2}-1\geq 0$ .

The upper and lower bounds on $J_{\rm st}(U_{n})$ given by (4) and (5) are compatible in the sense that (since by (20) the $\Theta^{(2)}\leq\frac{2}{\Sigma}$ ) we know

[TABLE]

where the final inequality simply follows from the case $n=1$ of (5).

The need for finiteness of the Poincaré constant to ensure $O(1/n)$ convergence of Fisher information and of relative entropy was removed in subsequent work of Bobkov, Chistyakov and Götze (see for example [9] for Fisher information and [7, 8] for relative entropy). These papers proved this rate of convergence under the assumption of finite fourth moment, as well as a variety of related results under a moment-matching assumption. Note that (by Lemma 3.3 below) if the fourth moment is infinite, our methods do not give $O(1/n)$ convergence, so our results should be regarded as weaker. However, papers [7, 8] used a detailed argument involving Edgeworth expansions, truncation of densities and analysis of the characteristic function to derive their results. We believe our results are obtained in a more straightforward way, and the connection to maximal correlation in this context may be of independent interest. Further, we prove a novel strengthened form of monotonicity, Theorem 6.3, which places monotonicity and convergence results in the same framework, whereas they have often historically been treated separately.

An alternative perspective was provided by Courtade, Fathi and Pananjady [15], who weakened the Poincaré constant assumption to require only the existence of a Stein kernel $\tau$ (which holds for any centered random variable with connected support). Using this, they proved an $O(1/n)$ rate of convergence in Wasserstein distance and an $O(\log n/n)$ rate of convergence in relative entropy, with the speed of convergence being dictated by the Stein discrepancy (squared distance from the Stein kernel $\tau$ to the identity). This work has the considerable advantage of holding in more general spaces $\mathbb{R}^{d}$ for $d\geq 1$ . It would be of interest to understand the relationship between our $\Theta^{(2)}>0$ condition and the Stein condition of [15].

The problem of proving information–theoretic versions of the Central Limit Theorem is a long-standing one, the early history of which is reviewed in [22]. In particular, we mention work of Linnik [25] and Shimizu [31]. However, our work follows the idea of studying projections of score functions, and follows a path first set out by Stam [33], Brown [11], Barron [6], as well as exploiting subsequent developments. In particular, the analysis of [23] exploited the fact that in the limit the score function of the limit must simultaneously be both a ridge function (a function $f(x_{1}+\ldots+x_{n})$ ) and close to being the sum $f_{1}(x_{1})+\ldots+f_{n}(x_{n})$ , and hence must be close to being linear.

This analysis generalized a key step in the work of Brown (and later in Barron [6]), which was an inequality [11, Lemma 3.1] concerning properties of Hermite polynomials, which are orthogonal in the Gaussian case. Our work can be seen as giving an alternative generalization of this, using an orthogonal function expansion based on the Singular Value Decomposition. The link between these two ideas is the fact that the Hermite polynomials provide the Singular Value Decomposition of conditional expectations in the Gaussian case (see [28, Theorem 3] and Example 3.1).

The structure of the remainder of the paper is as follows: in Section 2 we formally define the conditional expectation operators and the eigenvalue–related quantity $\Theta^{(n)}$ . In Section 3 we give examples where we can calculate $\Theta^{(n)}$ explicitly, discuss properties of $\Theta^{(n)}$ and show how it relates to other quantities. In Section 4 we discuss how standard results allow us to control the value of the standardized Fisher information $J_{\rm st}$ on convolution, in terms of $\Theta^{(n)}$ . In Section 5 we discuss how to control higher order terms in the Dembo–Kagan–Shepp argument, and hence bound $\Theta^{(n)}$ in terms of $\Theta^{(2)}$ . In Section 6 we show how these arguments imply a stronger form of monotonicity of Fisher information. We conclude with some suggestions for future work in Section 7.

2 Conditional expectation operator definitions

We introduce notation based on [28]. For any probability measure $\mathbb{P}$ we write $L^{2}(\mathbb{P})$ for the Hilbert space endowed with inner product $\langle f,g\rangle_{\mathbb{P}}=\int f(x)g(x)d\mathbb{P}(x)$ . Write $\mathbb{P}_{Y}$ and $\mathbb{P}_{S_{n}}$ for the law of the relevant random variables, where as before $S_{n}=Y_{1}+\ldots+Y_{n}$ .

Definition 2.1.

Define conditional expectation operator $C_{(n)}:L^{2}(\mathbb{P}_{Y})\mapsto L^{2}(\mathbb{P}_{S_{n}})$ and its adjoint $C_{(n)}^{*}:L^{2}(\mathbb{P}_{S_{n}})\mapsto L^{2}(\mathbb{P}_{Y})$ by:

[TABLE]

These maps are adjoint in the sense that (by direct calculation, or the tower law) for all $f$ and $g$ :

[TABLE]

Assumption 1.

We assume throughout this paper that the self-adjoint map $C_{(n)}^{*}C_{(n)}$ is diagonalizable.

Definition 2.2.

Under Assumption 1 write $(f^{(1)}_{k})_{k=\{0,1,\ldots\}}\in L^{2}(\mathbb{P}_{Y})$ for the basis of orthonormal eigenfunctions of $C_{(n)}^{*}C_{(n)}$ , with corresponding eigenvalues $\lambda^{(n)}_{k}$ and singular values $\mu^{(n)}_{k}=\sqrt{\lambda^{(n)}_{k}}$ . Here, without loss of generality, we assume that

[TABLE]

We write $g^{(n)}_{k}=(C_{(n)}f^{(1)}_{k})/\mu^{(n)}_{k}$ for the scaled images of these eigenfunctions.

Remark 2.3.

Note that by (8) the functions $g^{(n)}_{k}$ are orthonormal in $L^{2}(\mathbb{P}_{S_{n}})$ . Further, note that

[TABLE] 2. 2.

Note that $f^{(1)}_{0}=g^{(n)}_{0}\equiv 1$ and the pair $(f^{(1)}_{1},g^{(n)}_{1})$ achieves the maximum correlation since by (8) we know

[TABLE] 3. 3.

In this i.i.d. case, we can take $f^{(1)}_{1}(y)=y/\sigma$ , $g^{(n)}_{1}(s)=s/(\sigma\sqrt{n})$ with $\mu^{(n)}_{1}=1/\sqrt{n}$ . This choice of functions has the relevant properties since by symmetry (or the fact that averages of i.i.d. random variables form a reverse martingale)

[TABLE]

and

[TABLE]

The DKS identity (3) tells us that no larger value of $\mu^{(n)}_{1}$ is possible.

The focus of this paper will be the quantity $\Theta^{(n)}$ defined in terms of the second-highest non-trivial eigenvalue of the self-adjoint map $C_{(n)}^{*}C_{(n)}$ as:

Definition 2.4.

Using the notation above, write

[TABLE]

The Dembo–Kagan–Shepp identity [17] means that for $k\geq 2$ , eigenvalues $\lambda^{(n)}_{k}$ are $\leq 1/n$ , which ensures that $\Theta^{(n)}\geq 0$ . While we are not aware of existing results in the literature that bound $\lambda^{(n)}_{k}$ for $k\geq 2$ , we remark that the higher order eigenfunctions $f^{(1)}_{k}$ and $g^{(n)}_{k}$ (for $1\leq k\leq K$ , for some fixed $K$ ) have been used in a manner similar to Principal Components Analysis to capture significant high-order features of datasets [27].

One possible strategy to show that $\Theta^{(2)}>0$ is to show that $C_{(n)}$ and $C_{(n)}^{*}$ are compact operators (recall from e.g. [32, Section 3.1] that a compact linear operator is one for which the image of any bounded subset has compact closure). The Riesz–Schauder Theorem [32, Theorem 3.3.1] states that the only possible accumulation point of eigenvalues of a compact operator is at 0, so if the eigenspace corresponding to $\lambda=1/2$ has dimension 1 then we can deduce $\lambda^{(2)}_{2}<1/2$ . We consider the second point in Remark 5.5 below, and discuss the question of compactness now.

This compactness is stated as [10, Assumption 5.2], which states that it ‘is satisfied in most cases of interest’ and in particular if a sufficient condition [10, Eq. (5.4)] holds – we derive this condition here for completeness. As in [28, Eq. (40] we can expand the Radon-Nikodym derivative between joint and marginal densities using the Singular Value Decomposition as:

[TABLE]

Note that $(C_{(n)}f)(s)=\int\tau_{n}(z,s)p_{Y}(z)f(z)dz$ and $(C_{(n)}^{*}g)(y)=\int\tau_{n}(y,s)p_{S_{n}}(s)g(s)ds$ , so

[TABLE]

where $L_{n}(z,y):=\int p_{S_{n}}(s)\tau_{n}(z,s)\tau_{n}(y,s)ds$ is symmetric, as expected. Then, $(C_{(n)}^{*}C_{(n)})$ is compact if this is a trace-class operator (see [32, Section 3.6]), or in other words that we can use Mercer’s Theorem ([32, Theorem 3.11.9]) to verify that

[TABLE]

(this is [10, Eq. (5.4)]).

Note that this quantity $T_{n}(Y)$ has the property that $T_{n}(Y)-1=D_{\chi^{2}}(p_{Y_{1},S_{n}}\|p_{Y}\times p_{S_{n}})$ , where $Y$ is an independent copy of $Y_{1}$ and we write $D_{\chi^{2}}(f\|g)=\int(f(x)/g(x)-1)^{2}g(x)dx=\int f(x)^{2}/g(x)dx-1$ for the $\chi^{2}$ -divergence. Using this, an anonymous referee provided a prood of the following theorem, which shows that arbitrarily small Gaussian regularizations of sub-Gaussian random variables have the trace-class property:

Theorem 2.5.

For any $\delta>0$ , taking $Z$ Gaussian with mean [math] and variance $\delta^{2}$ and $Y\sim X+Z$ , then writing $X^{\prime}$ for an independent copy of $X$ :

[TABLE]

Hence if $X$ is sub-Gaussian then for $n$ sufficiently large $T_{n}(Y)<\infty$ and hence $(C_{(n)}^{*}C_{(n)})$ is compact.

Proof.

See Appendix A. ∎

We briefly mention that by linearizing the logarithm, we can bound the mutual information $I(Y_{1};S_{n})=D(P_{Y_{1},S_{n}}\|P_{Y_{1}}P_{S_{n}})=\int p_{Y_{1}}(y)p_{S_{n}}(s)\tau_{n}(y,s)\ln\left(\tau_{n}(y,s)\right)dyds\leq T_{n}(Y)-1$ . Further, $I(Y_{1};S_{n})=H(Y_{1})+H(S)-H(Y_{1},S_{n})=H(Y_{1})+H(S_{n})-H(Y_{1},Y_{2})=H(S_{n})-H(Y)\geq(\log 2)/2$ , where the lower bound on $H(S_{n})-H(Y)$ follows by Shannon’s Entropy Power Inequality. In other words, finiteness of $T_{2}:=T_{2}(Y)$ ensures a reverse Entropy Power Inequality of the form $\exp(2h(S_{n}))\leq C\exp(2h(Y))$ , where $C=\exp(2(T_{2}-1))$ .

Example 2.6.

In the case where $p\sim N(0,1)$ , we can explicitly write down $\tau_{2}(y,s)=\sqrt{2}\exp\left(-\frac{1}{4}(s^{2}-4sy+2y^{2})\right)$ , and direct calculation gives that

[TABLE]

This confirms the values in Example 3.1 below, which gives that the eigenvalues are $\lambda^{(2)}_{k}=2^{-k}$ , and so $\sum_{k=0}^{\infty}\lambda^{(2)}_{k}=2$ , confirming the value of the trace by Lidskii’s Theorem [32, Corollary 3.12.3].

Remark 2.7.

This formulation gives an alternative proof of the Dembo–Kagan–Shepp identity for $n=2$ , using the fact that $\tau_{2}(z,s)p_{Y}(z)=p_{Y}(z)p_{Y}(s-z)=\tau_{2}(s-z,s)p_{Y}(s-z)$ . Fix $s$ and for function $f(z)$ , write $\overline{f}(z)=f(s-z)$ . Then

[TABLE]

Hence, for any $f$ with $\int f(z)p_{Y}(z)dz=0$ , Cauchy-Schwarz gives

[TABLE]

since $\int p_{Y}(z)\tau_{2}(z,s)dz\equiv 1$ . The result follows on multiplying by $p_{S}(s)$ and integrating, to deduce that $4\mathbb{E}(C_{(n)}f(S))^{2}\leq 2\mathbb{E}f(Y)^{2}$ , or $\lambda^{(2)}_{2}\leq 1/2$ .

Note that (see also Remark 5.5 below) equality holds in (2.7) if and only if $f(z)+\overline{f}(z)$ is constant in $z$ . Taking a derivative with respect to $z$ , we deduce that $f^{\prime}$ must be constant, or that $f$ linear is the unique eigenfunction achieving $\lambda=1/2$ .

3 Conditional expectation operator properties

We now review two examples where we can explicitly calculate the eigenfunctions and eigenvalues of $C_{(n)}^{*}C_{(n)}$ , using properties of orthogonal polynomials [1], and hence deduce the value of $\Theta^{(n)}$ . Instead of orthogonal polynomials, these calculations can alternatively be performed using properties of the associated semigroups (Ornstein–Uhlenbeck and Laguerre semigroups, respectively). First, the Gaussian case (see also [28]):

Example 3.1.

If $Y_{i}$ are Gaussian with variance $\sigma^{2}$ , then $f^{(1)}$ and $g^{(n)}$ are orthonormalized Hermite polynomials. For any $\alpha$ we define $H_{n}^{(\alpha)}(x)=H_{n}(x/\sqrt{\alpha})$ (where $H_{n}$ are the Hermite polynomials, which are orthogonal with respect to standard Gaussian weights). By adapting the addition formula [1, Eq. (22.12.8)] or by direct calculation using the generating function we know that for any $\tau^{2}$ , $n$ and $m$ :

[TABLE]

Taking $\tau^{2}=\sigma^{2}$ in (15), and since for $Z$ Gaussian with mean [math] and variance $\sigma^{2}(n-1)$ we know $\mathbb{E}H_{k}^{((n-1)\sigma^{2})}(Z)=0$ for $k\geq 1$ , we can deduce that

[TABLE]

Taking $f^{(1)}_{k}=H_{k}^{(\sigma^{2})}/\sqrt{k!}$ with $g^{(n)}_{k}=H_{k}^{(n\sigma^{2})}/\sqrt{k!}$ and $\mu^{(n)}_{k}=n^{-k/2}$ we have $C_{(n)}^{*}g^{(n)}_{k}=\mu^{(n)}_{k}f^{(1)}_{k}$ as required.

For completeness, the property that $C_{(n)}f^{(1)}_{k}=\mu^{(n)}_{k}g^{(n)}_{k}$ follows since for fixed $s$ the $Y|(S_{n}=s)\sim s/n+\widetilde{Z}$ , where $\widetilde{Z}$ is Gaussian with mean [math] and variance $(n-1)\sigma^{2}/n$ . Hence taking $\tau^{2}=\sigma^{2}/n$ in the addition formula (15) we obtain

[TABLE]

where the final identity follows by definition of $H_{m}^{(\alpha)}$ .

We deduce that $\lambda^{(n)}_{2}=1/n^{2}$ and so $\Theta^{(n)}=n-1$ , with $\Theta^{(2)}=1$ in particular.

Next, we give a similar argument in the gamma distributed case. Note that although the $Y_{i}$ do not have mean 0, the argument carries through essentially unchanged on centering.

Example 3.2.

If $Y_{i}$ are $\Gamma(\beta,1)$ distributed then, writing $L^{(\alpha)}$ for the generalized Laguerre polynomials (orthogonal with respect to $\Gamma(\alpha+1,1)$ ), a similar addition formula [1, Eq. (22.12.6)] holds:

[TABLE]

For $f^{(1)}_{k}=L^{(\beta-1)}_{k}/\sqrt{\binom{k+\beta-1}{k}}$ with $g^{(n)}_{k}=L^{(\beta n-1)}_{k}/\sqrt{\binom{k+\beta n-1}{k}}$ and $\mu^{(n)}_{k}=\sqrt{\binom{k+\beta-1}{k}/\binom{k+\beta n-1}{k}}$ we deduce $C_{(n)}^{*}g^{(n)}_{k}=\mu^{(n)}_{k}f^{(1)}_{k}$ as required.

The property that $C_{(n)}f^{(1)}_{k}=\mu^{(n)}_{k}g^{(n)}_{k}$ follows by expressing the conditional density of $Y|S_{n}$ in terms of a beta function and using [1, Eq. (22.13.13)]:

[TABLE]

to deduce that $C_{(n)}L_{k}^{(\beta-1)}(s)=\left(\mu^{(n)}_{k}\right)^{2}L_{k}^{(\beta n-1)}(s)$ , and rescaling.

Hence $\lambda^{(n)}_{2}=\binom{\beta+1}{2}/\binom{\beta n+1}{2}=(\beta+1)/(n(\beta n+1))$ and so $\Theta^{(n)}=\beta(n-1)/(\beta+1)$ , with $\Theta^{(2)}=\beta/(\beta+1)$ in particular.

Note that (as we might expect) the larger the value of $\beta$ , the closer the value of $\Theta^{(2)}=\beta/(\beta+1)$ obtained in Example 3.2 becomes to the value $\Theta^{(2)}=1$ obtained for the Gaussian case in Example 3.1.

Next, motivated by the fact that in both the Gaussian and gamma cases the eigenfunction $f^{(1)}_{2}$ is quadratic, we use properties of quadratic functions to deduce an upper bound on $\Theta^{(n)}$ involving third and fourth moments.

Lemma 3.3.

For $Y_{i}$ i.i.d. $\sim Y$ with mean 0 and variance $\sigma^{2}$ , define the scale-invariant quantity $\Sigma=\mathbb{E}Y^{4}/\sigma^{4}-(\mathbb{E}Y^{3}/\sigma^{3})^{2}-1\geq 0$ (kurtosis minus squared skewness minus $1$ ), which does not depend on $n$ . Then

[TABLE]

In particular, taking $n=2$ in (19) we deduce

[TABLE]

Proof.

Consider the function $h(s)=s^{2}-as-n\sigma^{2}$ , where taking $a=\mathbb{E}Y^{3}/\sigma^{2}$ ensures that $\mathbb{E}h(S_{n})S_{n}=\mathbb{E}S_{n}^{3}-a\mathbb{E}S_{n}^{2}=n\mathbb{E}Y^{3}-an\sigma^{2}=0$ as required. Direct calculation shows that $C_{(n)}^{*}h(y)=y^{2}-ay-\sigma^{2}$ . Further, expanding the square we can show $\mathbb{E}h(S_{n})^{2}=n\sigma^{4}\Sigma+2n(n-1)\sigma^{4}$ and $\mathbb{E}\left((C_{(n)}^{*}h)(Y)^{2}\right)=\sigma^{4}\Sigma$ (this expression as the expectation of a square ensures that $\Sigma\geq 0$ holds). Since it is expressed as an infimum over all functions,

[TABLE]

as required. ∎

Remark 3.4.

We observe that:

Equation (20) shows that if $\Theta^{(2)}>0$ then $\Sigma<\infty$ . Equivalently if $\Sigma=\infty$ , we know $\Theta^{(2)}=0$ (and the Poincaré constant is infinite). 2. 2.

Note that the values of $\Theta^{(2)}$ found in Examples 3.1 and 3.2 both satisfy (20) with equality, because the relevant eigenfunction $f^{(1)}_{2}$ is quadratic. In the Gaussian case Example 3.1, $\Sigma=3-0-1=2$ , consistent with the value $\Theta^{(2)}=1$ . In the gamma case Example 3.2, $\Sigma=(3+6/\beta)-4/\beta-1=2+2/\beta$ , consistent with the value $\Theta^{(2)}=\beta/(\beta+1)$ . 3. 3.

Note also that (20) means that if $\Sigma>2$ (which, roughly speaking, corresponds to $Y$ having heavier tails than the Gaussian) then by (19) the $\Theta^{(2)}<1$ (smaller than the value in the Gaussian case, Example 3.1).

Indeed, we can prove similar (if more involved) bounds which show that positivity of $\Theta^{(2)}$ implies finiteness of all moments. As before,the following proposition implies that if $\Theta^{(2)}>0$ and the $(2k-2)$ th moment of $Y$ is finite then the $(2k)$ th moment of $Y$ must be finite.

Proposition 3.5.

Writing $m_{k}=\mathbb{E}Y^{k}$ for the $k$ th moment of $Y$ and $\sigma^{2}$ for its variance, there exist functions $B_{1,k}$ and $B_{2,k}$ (depending on moments of lower orders) such that

[TABLE]

Proof.

Write

[TABLE]

for the $k$ th moment of $S_{2}$ . As in Lemma 3.3, consider the function $h(s)=s^{k}-as-M_{k}$ , where taking $a=M_{k+1}/(2\sigma^{2})$ ensures that $\mathbb{E}h(S_{2})S_{2}=0$ . Using (21) we can expand

[TABLE]

Substituting $y_{2}=Y_{2}$ and taking expectations, we deduce that

[TABLE]

meaning that we can rewrite (22) as

[TABLE]

where

[TABLE]

Since by construction $\mathbb{E}C_{(n)}^{*}h(Y)=0$ , we can deduce by independence of $Y_{1}$ and $Y_{2}$ that the cross terms vanish so that

[TABLE]

so that

[TABLE]

We deduce the result using (24) and the facts that $\mathbb{E}h(Y_{1}+Y_{2})^{2}=M_{2k}-M_{k}^{2}-M_{k+1}^{2}/\sigma^{2}$ and

[TABLE]

∎

Lemma 3.6.

Assuming $J(Y)<\infty$ , the finiteness of the Poincaré constant $C_{P}:=C_{P}(Y)$ implies that $\Theta^{(2)}>0$ . Indeed:

[TABLE]

Proof.

We can deduce this using [23, Proposition 2.1] which, for $Y_{1}$ and $Y_{2}$ i.i.d., gives that for any $f$ with $\mathbb{E}f(S_{2})=\mathbb{E}f(S_{2})S_{2}=0$ and taking $g(u)=\mathbb{E}f(u+Y)$ then

[TABLE]

for some $\mu$ , $\nu$ . The proof of [23, Proposition 2.1] states that $\nu=\mu\mathbb{E}Y_{1}=0$ . Further, by symmetry, the condition $\mathbb{E}f(S_{2})S_{2}=0$ implies that $0=\mathbb{E}f(Y_{1}+Y_{2})Y_{1}=\mathbb{E}g(Y_{1})Y_{1}$ , so the RHS of (26) is $\geq\frac{1}{J(Y)C_{P}}\mathbb{E}\left(g(Y)\right)^{2}$ . Rearranging, we deduce that

[TABLE]

and the result follows on rearranging. ∎

4 Behaviour of the Fisher information on convolution

We now consider how the standardized Fisher information behaves on convolution, under a standard Central Limit Theorem scaling. That is, as in [13], we write $U_{n}=S_{n}/\sqrt{n}$ . Note that in the i.i.d. regime, since $J(cX)=J(X)/c^{2}$ (see e.g. [11, Eq. (2.3)]) we know that $J_{\rm st}(U_{n})=\sigma^{2}J(U_{n})-1=\sigma^{2}J(S_{n}/\sqrt{n})-1=\sigma^{2}nJ(S_{n})-1$ (scale-invariance of $J_{\rm st}$ ).

Proposition 4.1.

For i.i.d. $Y_{1},\ldots,Y_{n}$ the standardized Fisher information satisfies

[TABLE]

Proof.

Observe that (see for example [13, Eq. (3)], [33]) that the score function of the sum satisfies

[TABLE]

which we can rewrite as $\varrho_{S_{n}}=(C_{(n)}\varrho_{Y})$ . Hence if we expand the score function as a sum of eigenfunctions

[TABLE]

then Definition 2.2 gives that:

[TABLE]

Further, direct calculation using integration by parts gives that

[TABLE]

This means that, using the fact that (see Remark 2.3.2) the $f^{(1)}_{1}(x)=x/\sigma$ , $g^{(n)}_{1}(y)=y/(\sigma\sqrt{n})$ with $\mu^{(n)}_{1}=1/\sqrt{n}$ we can write the standardized score functions of $Y$ and $S_{n}$ from (4) as sums of eigenfunctions starting at index 2, as:

[TABLE]

Then, direct calculation using the orthonormality of $f^{(1)}$ and $g^{(n)}$ gives that:

[TABLE]

using the fact that $n\left(\mu^{(n)}_{k}\right)^{2}=n\lambda^{(n)}_{k}\leq 1/(1+\Theta^{(n)})$ for $k\geq 2$ by (10). ∎

We can use a similar argument to prove the lower bound on Fisher information, Lemma 1.3:

Proof of Lemma 1.3.

As in Lemma 3.3 consider the function $h(s)=s^{2}-as-n\sigma^{2}$ where $a=\mathbb{E}Y^{3}/\sigma^{2}=\gamma_{3}\sigma$ . As above, since $\mathbb{E}h(S_{n})S_{n}=0$

[TABLE]

Now considering the LHS of (34) using Cauchy-Schwarz, we deduce that

[TABLE]

since as before $\mathbb{E}h(S_{n})^{2}=n\sigma^{4}\Sigma+2n(n-1)\sigma^{4}$ , and the result follows by rearrangement. ∎

This lower bound tightens [23, Lemma 1.4], which (in our notation) can be expressed as

[TABLE]

where the original result is expressed in terms of the excess kurtosis $k=\mathbb{E}Y^{4}/\sigma^{4}-3=\Sigma+\gamma_{3}^{2}-2$ .

5 Higher order Dembo–Kagan–Shepp terms

Proposition 4.1 gives one part of the proof of Theorem 1.2. However, this result as stated is not particularly helpful, since the form of the dependence of $\Theta^{(n)}$ on $n$ is not immediately clear. We complete the proof of Theorem 1.2 by proving Proposition 5.2 below, which allows us to control $\Theta^{(n)}$ .

The key observation is that we can analyse higher order terms in the Dembo–Kagan–Shepp argument, following the proof of [17, Lemma 2].

Lemma 5.1.

Fix $k>\ell\geq 2$ , and consider a function $h$ with $\mathbb{E}h(S_{k})=0$ . Then

[TABLE]

where $h_{1}(u)=\mathbb{E}h(u+Y_{2}+\ldots+Y_{k})$ and $\widehat{h}(v)=\mathbb{E}h(v+Y_{\ell+1}+\ldots+Y_{k})$ .

Proof.

We adopt the same notation as [17, Section 2]. As in [17, Eq. (14), (15)], we can perform an Efron–Stein (ANOVA) expansion [18] of $h$ and $\widehat{h}$ (using the same functions $h_{i}$ in each case) to obtain

[TABLE]

The key observation is that for any $k>\ell\geq 2$ and any $r\geq 2$ , direct comparison of the two terms gives

[TABLE]

with equality if and only if $r=2$ . Applying this to the Efron–Stein decompositions (36) and (37) we obtain

[TABLE]

as required. ∎

We now deduce a result which, when combined with Proposition 4.1 above, allows us to deduce the proof of Theorem 1.2:

Proposition 5.2.

The quantity $\Theta^{(k)}/(k-1)$ is non-decreasing in $k$ . Specifically for any $n\geq 2$ :

[TABLE]

Proof.

The key fact is that the function $h_{1}$ arising in Lemma 5.1 can be understood as the conditional expectation of both $h$ and $\widehat{h}$ (this is remarked at the foot of [17, P.345], and is due to orthogonality of the Efron–Stein decomposition). That is, for $k>\ell\geq 2$ we can write

[TABLE]

since $\widehat{h}(v)=\mathbb{E}h(v+Y_{\ell+1}+\ldots+Y_{k})$ . Hence for any $h$ (and hence $\widehat{h}$ and $h_{1}$ ) we can write

[TABLE]

so the RHS of (38) becomes

[TABLE]

or dividing by $k\mathbb{E}h_{1}(Y)^{2}$ and taking the optimal $h$ :

[TABLE]

and the result (39) follows on taking $k=n$ and $\ell=2$ . ∎

Note that we can weaken the assumption that $\Theta^{(2)}>0$ to ensure $O(1/n)$ convergence of Fisher information, to simply require that $\Theta^{(m)}>0$ for some $m$ . If this is true, we can simply replace (39) by a bound of the form $\Theta^{(n)}\geq(n-1)\Theta^{(m)}/(m-1)$ and substitute this in Proposition 4.1 instead.

Example 5.3.

In the Gaussian and gamma cases of Examples 3.1 and 3.2 the result of Proposition 5.2 is sharp. That is, if $Y_{i}\sim N(0,\sigma^{2})$ then recall that $\Theta^{(k)}=k-1$ and hence $\Theta^{(k)}/(k-1)\equiv 1$ . Similarly if $Y_{i}\sim\Gamma(\beta,1)$ then $\Theta^{(k)}=(k-1)\beta/(\beta+1)$ and $\Theta^{(k)}/(k-1)\equiv\beta/(\beta+1)$ .

This sharpness holds because in both Example 3.1 and 3.2 the optimal eigenfunction is quadratic, so in the Efron–Stein decomposition the $h_{3}=h_{4}=\ldots=0$ .

Remark 5.4.

By combining Proposition 5.2 with Equation (19) we can deduce that $\Theta^{(n)}$ is bounded above and below by linear functions in $(n-1)$ , assuming $\Theta^{(2)}$ and $\Sigma$ are non-zero, as

[TABLE]

Remark 5.5.

Although not mentioned in [17], similar arguments show that under regularity conditions there should be a unique eigenfunction achieving eigenvalue $1/n$ (we know from Example 2.3.3 above that the linear functions achieve this). That is, assuming $k>\ell$ there is equality in

[TABLE]

if and only if $r=1$ . Hence there is equality in [17, Lemma 2] if and only if $\mathbb{E}h_{2}^{2}=\mathbb{E}h_{3}^{2}=\ldots=0$ . Hence except on a set of measure 0 we know that

[TABLE]

Assuming $h$ is twice differentiable then taking a derivative with respect to $y_{1}$ and $y_{2}$ this implies $h^{\prime\prime}(z)\equiv 0$ for all $z$ , so $h$ is linear.

6 Strengthened monotonicity

We can extend the arguments above to deduce a stronger form of monotonicity of Fisher information than that obtained by [4] and [13], at least in the i.i.d. case:

Definition 6.1.

For $m\leq n$ , define $C_{(n,m)}^{*}$ by $\left(C_{(n,m)}^{*}g\right)(y)=\mathbb{E}\left[g(S_{n})|S_{m}=y\right]$ , and write $\lambda^{(n,m)}_{k}$ for the ordered eigenvalues of $C_{(n,m)}^{*}C_{(n,m)}$ , where $\lambda^{(n,m)}_{0}=1$ and (by DKS [17] (3)) $\lambda^{(n,m)}_{1}=m/n$ . Again, write $\mu^{(n,m)}_{k}=\sqrt{\lambda^{(n,m)}_{k}}$ .

Define a generalization of $\Theta^{(n)}$ as

[TABLE]

As before, the Dembo–Kagan–Shepp identity [17] ensures that $\Theta^{(n,m)}\geq 0$ . Note we recover Definition 2.4 by taking $m=1$ . We now give a result which generalizes Proposition 4.1.

Proposition 6.2.

For i.i.d. $Y_{1},\ldots,Y_{n}$ the standardized Fisher information satisfies

[TABLE]

Proof.

We repeat the steps of the proof of Proposition 4.1. Again (see for example [13, Eq. (3)], [33]) the score function of the sum satisfies

[TABLE]

which we can rewrite as $\varrho_{S_{n}}=(C_{(n,m)}\varrho_{S_{m}})$ . Hence if we expand the score function

[TABLE]

then

[TABLE]

As before, direct calculation using integration by parts gives that

[TABLE]

Again (as in Remark 2.3.2) the $f^{(m)}_{1}(y)=y/(\sigma\sqrt{m})$ , $g^{(n)}_{1}(s)=s/(\sigma\sqrt{n})$ with $\mu^{(n,m)}_{1}=\sqrt{m/n}$ so we can write the standardized score functions of $S_{m}$ and $S_{n}$ from (4) as sums of eigenfunctions starting at index 2, as:

[TABLE]

Just as before, we can use the orthonormality of $f^{(m)}$ and $g^{(n)}$ to deduce

[TABLE]

using the fact that $n\lambda^{(n,m)}_{k}\leq m/(1+\Theta^{(n,m)})$ for $k\geq 2$ . ∎

As in [13], taking the Dembo–Kagan–Shepp bound $\Theta^{(n,m)}\geq 0$ in Proposition 6.2 we recover the monotonicity of standardized Fisher information [4]. However, we can obtain better bounds by taking $k=n$ and $\ell=m$ in Lemma 5.1 to obtain

[TABLE]

Rearranging, and optimizing over $h$ we deduce that

[TABLE]

Since this is an increasing function of $\Theta^{(m)}$ , we can replace $\Theta^{(m)}$ by the lower bound $(m-1)\Theta^{(2)}$ from (39) to obtain

[TABLE]

which, in Proposition 6.2 allows us to deduce the stronger form of monotonicity that:

Theorem 6.3.

Consider i.i.d. $Y_{i}\sim Y$ with mean [math] and variance $\sigma^{2}<\infty$ and smooth densities on $\mathbb{R}$ . Writing $\Theta^{(2)}$ for the quantity from Definition 2.4, the standardized Fisher information has the property that

[TABLE]

Note that this is a simultaneous strengthening of Theorem 1.2 and of the monotonicity of Fisher information proved in the i.i.d. case by Artstein et al. [4] and [13].

7 Future work

We briefly mention some future directions for research. Note that some progress is made towards 1. and 2. in Appendix A below:

In order to increase the value of these results, it is a natural question to ask for sufficient conditions (in terms of the density $p_{Y}$ or other related quantities) under which $\Theta^{(2)}>0$ , and indeed to give explicit bounds of the form $\Theta^{(2)}\geq c$ for some $c>0$ . 2. 2.

Additionally, it would be of value to give conditions on $U$ under which we can bound $\Theta^{(2)}$ uniformly away from [math] for all $t>0$ , for random variables of the form $Y=U+Z_{t}$ , where $Z_{t}$ is an independent Gaussian perturbation. Such a result would allow us to derive $O(1/n)$ convergence of relative entropy using the de Bruijn identity [33]. 3. 3.

Since the monotonicity of entropy is equivalent to strengthened forms of Shannon’s Entropy Power Inequality (see [4, 26, 34]), it would be of interest to know if the strengthened monotonicity result Theorem 6.3 implies a stronger Entropy Power Inequality. 4. 4.

The results of this paper very much rely on the i.i.d. assumption. It is of interest to weaken this to the independent, but not identical setting, and indeed to dependent random variables, for example in the exchangeable setting. For example, Peccati [29] shows that a decomposition of the Efron–Stein type used to establish the Dembo–Kagan–Shepp identity holds if an exchangeable sequence has the ‘weak independence’ property. It is a natural question whether the results of this paper hold in that setting. 5. 5.

Following recent trends in information-theoretic Central Limit Theorems, it would be of interest to extend the results of this paper to the setting of $\mathbb{R}^{d}$ , and to understand the behaviour of the eigenfunctions of $C_{(n)}^{*}C_{(n)}$ in this setting, where the equivalent of (29) still holds (see e.g. [22, Lemma 3.4]).

Appendix A Proof of Theorem 2.5

The following argument was provided by an anonymous referee, for which the author is extremely grateful.

We write $\gamma_{\boldsymbol{\mu};\Sigma}$ for a Gaussian density centred at $\boldsymbol{\mu}$ with covariance matrix $\Sigma$ . As before, we write $D_{\chi^{2}}(f\|g)=\int(f(x)/g(x)-1)^{2}g(x)dx=\int f(x)^{2}/g(x)dx-1$ for the $\chi^{2}$ -divergence. We first state two lemmas:

Lemma A.1.

For any coupling of ${\mathbf{V}}\sim p$ and ${\mathbf{W}}\sim q$ :

[TABLE]

Proof.

Follows immediately from the joint convexity of $f$ -divergences (see for example [16, Lemma 4.1]). ∎

Lemma A.2.

For any $\rho\in(-1,1)$ , write $I_{2}$ for the two dimensional identiy matrix, and define the positive semi-definite matrix

[TABLE]

then for any ${\mathbf{x}},{\mathbf{y}}\in\mathbb{R}^{2}$ and $\delta>0$ :

[TABLE]

Proof.

The key is that for ${\mathbf{u}}\in\mathbb{R}^{2}$ we can express the ratio

[TABLE]

as a product of Gaussian densities, which integrate to 1. ∎

Proof of Theorem 2.5.

Write $X_{i}$ for i.i.d. copies of $X\sim p_{X}$ and independent $Z_{i}\sim\gamma_{0;\delta^{2}}$ , and define regularized $Y_{i}=X_{i}+Z_{i}$ . Define $V_{n}=n^{-1/2}\sum_{i=1}^{n}X_{i}$ and $U_{n}=n^{-1/2}\sum_{i=1}^{n}Y_{i}$ . Further, define $X_{i}^{\prime}$ and $Z_{i}^{\prime}$ to be independent copies of $X_{i}$ and $Z_{i}$ respectively, write $Y_{i}^{\prime}=X_{i}^{\prime}+Z_{i}^{\prime}$ , and define $V_{n}^{\prime}=n^{-1/2}(X_{1}^{\prime}+X_{2}+\ldots+X_{n})$ and $U_{n}^{\prime}=n^{-1/2}(Y_{1}^{\prime}+Y_{2}+\ldots+Y_{n})$ .

Using the invariance of $f$ -divergences under $1-1$ mappings we can write

[TABLE]

For any $n$ , we can consider the coupling between $p_{X,V_{n}}$ and $p_{X}\times p_{V_{n}}$ given by $\left((X_{1},V_{n}),(X_{1},V_{n}^{\prime})\right)$ , and note that $(X_{1},V_{n})-(X_{1},V_{n}^{\prime})=(0,n^{-1/2}(X_{1}-X_{1}^{\prime})$ . This means that we can express the $\chi^{2}$ -divergence arising in the formula for $T_{n}(Y)-1$ as

[TABLE]

where we apply Lemma A.1 followed by Lemma A.2, and where $\rho=1/\sqrt{n}$ .

If $X$ is sub-Gaussian, then (see for example [35, Proposition 2.6.1]) so is $(X-X^{\prime})$ , and hence (see [35, Proposition 2.5.2]) there exists a constant $\theta$ such that the moment generating function of $(X-X^{\prime})^{2}$ satisfies

[TABLE]

when $|\lambda|\leq 1/\theta$ . Hence taking $\lambda^{2}=\frac{1}{(n-1)\delta^{2}}$ we deduce that (49) is bounded above by

[TABLE]

assuming that $\theta^{2}\leq 1/\lambda^{2}$ , or equivalently $n\geq 1+\theta^{2}/\delta^{2}$ . ∎

Observe that we can use (49) to deduce asymptotic bounds on $\Theta^{(n)}$ for sub-Gaussian random variables under sufficiently large amounts of Gaussian regularization. That is, if we write $\lambda_{k}^{n,\delta}$ or the eigenvalues of the operator $C_{(n)}^{*}C_{(n)}$ , then since we can express the trace as

[TABLE]

so that Taylor expanding the exponential

[TABLE]

using the fact that (see [35, P.26]) $\mathbb{E}|X_{1}-X_{1}^{\prime}|^{2r}\leq\phi^{2r}(2r)\Gamma(r)=2\phi^{2r}r!$ . Hence we deduce:

Corollary A.3.

Writing $Y_{i}=X_{i}+Z_{i}$ where $X_{i}$ is sub-Gaussian and $Z_{i}\sim\gamma_{0;\delta^{2}}$ , and writing $\lambda_{k}^{n,\delta}$ for the eigenvalues of the operator $C_{(n)}^{*}C_{(n)}$ we deduce:

[TABLE]

and hence $\Theta^{(n)}>0$ for $n$ sufficiently large for sub-Gaussian random variables regularized by a sufficiently large amount.

Note that this allows us to deduce $O(1/n)$ convergence of Fisher information for random variables of this type. Indeed, it allows us to deduce $O(1/n)$ convergence of relative entropy using the de Bruijn identity (see for example [22, Eq. (1.110]) which expresses the relative entropy of a random variable $U$ with density $f$ to a standard Gaussian density $\gamma_{0,1}$ as the integral of standardized Fisher information

[TABLE]

Using (50), since adding $Z_{\tau}$ provides extra regularization we can deduce bounds for all $\tau$ on the second–largest eigenvalue of the form $\frac{2{\rm Var\;}(X)}{\delta^{2}+\tau}$ and combining Theorem 1.2 with (51) we deduce $O(1/n)$ convergence of relative entropy, using the fact that

[TABLE]

for $C>D$ .

Acknowledgements

The author would like to thank Professor Thomas Courtade of the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley for extremely helpful discussions regarding this work, and for numerous pointers to relevant papers in the literature. I would also like to thank Professor Venkat Anantharam of the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley for valuable suggestions concerning the maximal correlation. The idea to consider the eigenfunctions in the maximal correlation problem grew out of a Twitter conversation with Dr James V Stone, Honorary Reader in Vision and Computational Neuroscience at the University of Sheffield. The author would like to thank the Associate Editor, and three anonymous referees for their close reading of this paper and extremely helpful suggestions.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables , volume 55 of National Bureau of Standards Applied Mathematics Series . U.S. Government Printing Office, 1964.
2[2] V. Anantharam, A. Gohari, S. Kamath, and C. Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover, 2013. See: ar Xiv:1304.6133 .
3[3] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. On the rate of convergence in the entropic central limit theorem. Probab. Theory Related Fields , 129(3):381–390, 2004.
4[4] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. Solution of Shannon’s problem on the monotonicity of entropy. J. Amer. Math. Soc. , 17(4):975–982 (electronic), 2004.
5[5] K. Ball and V. H. Nguyen. Entropy jumps for isotropic log-concave random vectors and spectral gap. Studia Mathematica , 213:81–96, 2012.
6[6] A. R. Barron. Entropy and the Central Limit Theorem. Ann. Probab. , 14(1):336–342, 1986.
7[7] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Rate of convergence and Edgeworth-type expansion in the entropic central limit theorem. Ann. Probab. , 41(4):2479–2512, 2013.
8[8] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Berry–Esseen bounds in the entropic central limit theorem. Probability Theory and Related Fields , 159(3-4):435–478, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Maximal correlation and the rate of Fisher information convergence in the Central Limit Theorem

Abstract

1 Introduction

Definition 1.1**.**

Theorem 1.2**.**

Lemma 1.3**.**

2 Conditional expectation operator definitions

Definition 2.1**.**

Assumption 1**.**

Definition 2.2**.**

Remark 2.3**.**

Definition 2.4**.**

Theorem 2.5**.**

Proof.

Example 2.6**.**

Remark 2.7**.**

3 Conditional expectation operator properties

Example 3.1**.**

Example 3.2**.**

Lemma 3.3**.**

Proof.

Remark 3.4**.**

Proposition 3.5**.**

Proof.

Lemma 3.6**.**

Proof.

4 Behaviour of the Fisher information on convolution

Proposition 4.1**.**

Proof.

Proof of Lemma 1.3.

5 Higher order Dembo–Kagan–Shepp terms

Lemma 5.1**.**

Proof.

Proposition 5.2**.**

Proof.

Example 5.3**.**

Remark 5.4**.**

Remark 5.5**.**

6 Strengthened monotonicity

Definition 6.1**.**

Proposition 6.2**.**

Proof.

Theorem 6.3**.**

7 Future work

Appendix A Proof of Theorem 2.5

Lemma A.1**.**

Proof.

Lemma A.2**.**

Proof.

Proof of Theorem 2.5.

Corollary A.3**.**

Acknowledgements

Definition 1.1.

Theorem 1.2.

Lemma 1.3.

Definition 2.1.

Assumption 1.

Definition 2.2.

Remark 2.3.

Definition 2.4.

Theorem 2.5.

Example 2.6.

Remark 2.7.

Example 3.1.

Example 3.2.

Lemma 3.3.

Remark 3.4.

Proposition 3.5.

Lemma 3.6.

Proposition 4.1.

Lemma 5.1.

Proposition 5.2.

Example 5.3.

Remark 5.4.

Remark 5.5.

Definition 6.1.

Proposition 6.2.

Theorem 6.3.

Lemma A.1.

Lemma A.2.

Corollary A.3.