Chentsov's theorem for exponential families

James G. Dowty

arXiv:1701.08895·math.ST·May 23, 2017

Chentsov's theorem for exponential families

James G. Dowty

PDF

TL;DR

This paper extends Chentsov's theorem to exponential families, showing the Fisher information metric is uniquely invariant under key statistical transformations, using a unified, less technical proof based on the central limit theorem.

Contribution

It proves a version of Chentsov's theorem for exponential families, characterizing the Fisher information metric as the unique invariant Riemannian metric under specific transformations.

Findings

01

Fisher information metric is uniquely invariant in exponential families

02

Unified proof applicable to both discrete and continuous cases

03

Simplifies previous approaches using the central limit theorem

Abstract

Chentsov's theorem characterizes the Fisher information metric on statistical models as essentially the only Riemannian metric that is invariant under sufficient statistics. This implies that each statistical model is naturally equipped with a geometry, so Chentsov's theorem explains why many statistical properties can be described in geometric terms. However, despite being one of the foundational theorems of statistics, Chentsov's theorem has only been proved previously in very restricted settings or under relatively strong regularity and invariance assumptions. We therefore prove a version of this theorem for the important case of exponential families. In particular, we characterise the Fisher information metric as the only Riemannian metric (up to rescaling) on an exponential family and its derived families that is invariant under independent and identically distributed extensions…

Equations84

s u + t v = (θ, s a + t b)

s u + t v = (θ, s a + t b)

P = p_{θ} μ and A = i = 1 \sum d a_{i} \frac{\partial p _{θ}}{\partial θ _{i}} μ .

P = p_{θ} μ and A = i = 1 \sum d a_{i} \frac{\partial p _{θ}}{\partial θ _{i}} μ .

g^{F} (\tilde{u}, \tilde{v}) = \int \frac{d A}{d P} \frac{d B}{d P} d P

g^{F} (\tilde{u}, \tilde{v}) = \int \frac{d A}{d P} \frac{d B}{d P} d P

Θ = {θ \in R^{d} \int exp (θ \cdot T) d μ < \infty},

Θ = {θ \in R^{d} \int exp (θ \cdot T) d μ < \infty},

p_{θ} (x) = exp (θ \cdot T (x)) / Z (θ)

p_{θ} (x) = exp (θ \cdot T (x)) / Z (θ)

p_{θ}^{(n)} = exp (n θ \cdot T_{n} - n lo g Z (θ)),

p_{θ}^{(n)} = exp (n θ \cdot T_{n} - n lo g Z (θ)),

(ϕ_{*} P) (U) = P (ϕ^{- 1} (U))

(ϕ_{*} P) (U) = P (ϕ^{- 1} (U))

Y \sim P implies ϕ (Y) \sim ϕ_{*} P .

Y \sim P implies ϕ (Y) \sim ϕ_{*} P .

q_{θ}^{n} (y) = exp (n θ \cdot y - n lo g Z (θ))

q_{θ}^{n} (y) = exp (n θ \cdot y - n lo g Z (θ))

g^{n} (\tilde{u}^{n}, \tilde{v}^{n}) = n g (\tilde{u}, \tilde{v})

g^{n} (\tilde{u}^{n}, \tilde{v}^{n}) = n g (\tilde{u}, \tilde{v})

Q_{n} = q_{θ}^{n} ν_{n} and A_{n} = i = 1 \sum d a_{i} (\partial q_{θ}^{n} / \partial θ_{i}) ν_{n} .

Q_{n} = q_{θ}^{n} ν_{n} and A_{n} = i = 1 \sum d a_{i} (\partial q_{θ}^{n} / \partial θ_{i}) ν_{n} .

g_{n} (\tilde{u}_{n}, \tilde{v}_{n}) = g^{n} (\tilde{u}^{n}, \tilde{v}^{n})

g_{n} (\tilde{u}_{n}, \tilde{v}_{n}) = g^{n} (\tilde{u}^{n}, \tilde{v}^{n})

g (\tilde{u}, \tilde{v}) = [h^{2} (\tilde{u} + \tilde{v}) - h^{2} (\tilde{u} - \tilde{v})] /4

g (\tilde{u}, \tilde{v}) = [h^{2} (\tilde{u} + \tilde{v}) - h^{2} (\tilde{u} - \tilde{v})] /4

h_{n} (\tilde{u}_{n}) = H (\tilde{u}_{n})

h_{n} (\tilde{u}_{n}) = H (\tilde{u}_{n})

H (P_{n}, f P_{n}) = H (Φ, f Φ)

H (P_{n}, f P_{n}) = H (Φ, f Φ)

H (L_{**} (P, A) = H (P, A)

H (L_{**} (P, A) = H (P, A)

L_{*} (f P) = (f \circ L^{- 1}) L_{*} P

L_{*} (f P) = (f \circ L^{- 1}) L_{*} P

H^{F} (Q, f Q) = \int (c \cdot y)^{2} d Q (y) = c^{T} (\int y y^{T} d Q (y)) c = c^{T} I c = ∥ c ∥,

H^{F} (Q, f Q) = \int (c \cdot y)^{2} d Q (y) = c^{T} (\int y y^{T} d Q (y)) c = c^{T} I c = ∥ c ∥,

Δ = {(P_{1}, \dots, P_{n}) \in \prod n M P_{1} = \dots = P_{n}}

Δ = {(P_{1}, \dots, P_{n}) \in \prod n M P_{1} = \dots = P_{n}}

(Y_{1} + \dots + Y_{n}) / n \sim Q_{n} .

(Y_{1} + \dots + Y_{n}) / n \sim Q_{n} .

τ_{θ} = \int y d Q_{1} (y) = \int y d Q_{n} (y),

τ_{θ} = \int y d Q_{1} (y) = \int y d Q_{n} (y),

Σ_{θ} = \int (y - τ_{θ}) (y - τ_{θ})^{T} d Q_{1} (y) = n \int (y - τ_{θ}) (y - τ_{θ})^{T} d Q_{n} (y) .

Σ_{θ} = \int (y - τ_{θ}) (y - τ_{θ})^{T} d Q_{1} (y) = n \int (y - τ_{θ}) (y - τ_{θ})^{T} d Q_{n} (y) .

A_{n} = i = 1 \sum d a_{i} \frac{\partial q _{θ}^{n}}{\partial θ _{i}} ν_{n} = i = 1 \sum d a_{i} n (ι_{i} - \frac{\partial lo g Z}{\partial θ _{i}}) q_{θ}^{n} ν_{n} = na \cdot (ι - τ_{θ}) Q_{n},

A_{n} = i = 1 \sum d a_{i} \frac{\partial q _{θ}^{n}}{\partial θ _{i}} ν_{n} = i = 1 \sum d a_{i} n (ι_{i} - \frac{\partial lo g Z}{\partial θ _{i}}) q_{θ}^{n} ν_{n} = na \cdot (ι - τ_{θ}) Q_{n},

L_{*} A_{n} = na \cdot (ι \circ L^{- 1} - τ_{θ}) L_{*} Q_{n} = n f L_{*} Q_{n},

L_{*} A_{n} = na \cdot (ι \circ L^{- 1} - τ_{θ}) L_{*} Q_{n} = n f L_{*} Q_{n},

h (\tilde{u})

h (\tilde{u})

= h_{n} (n^{- 1/2} \tilde{u}_{n}) by the bilinearity of g_{n}

= H (n^{- 1/2} \tilde{u}_{n}) by (\ref E:calcH)

= H (n^{- 1/2} L_{**} \tilde{u}_{n}) by (\ref E:affH)

= H (L_{*} Q_{n}, f L_{*} Q_{n}) by (\ref E:vecsp) and (\ref E:AnL).

h (\tilde{u})

h (\tilde{u})

= H (Φ, f Φ) by (\ref E:contH),

M Σ_{θ}^{1/2} a = Σ_{ϕ}^{1/2} b .

M Σ_{θ}^{1/2} a = Σ_{ϕ}^{1/2} b .

M_{**} (Φ, f Φ) = (M_{*} Φ, M_{*} (f Φ)) = (M_{*} Φ, (f \circ M^{- 1}) M_{*} Φ) = (Φ, e Φ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Chentsov’s theorem for exponential families

James G. Dowty

Abstract

Chentsov’s theorem characterizes the Fisher information metric on statistical models as essentially the only Riemannian metric that is invariant under sufficient statistics. This implies that each statistical model is naturally equipped with a geometry, so Chentsov’s theorem explains why many statistical properties can be described in geometric terms. However, despite being one of the foundational theorems of statistics, Chentsov’s theorem has only been proved previously in very restricted settings or under relatively strong regularity and invariance assumptions. We therefore prove a version of this theorem for the important case of exponential families. In particular, we characterise the Fisher information metric as the only Riemannian metric (up to rescaling) on an exponential family and its derived families that is invariant under independent and identically distributed extensions and canonical sufficient statistics. Our approach is based on the central limit theorem, so it gives a unified proof for both discrete and continuous exponential families, and it is less technical than previous approaches.

1 Introduction

Chentsov’s theorem is a foundational theorem in statistics that characterizes the Fisher information metric on statistical models as the only Riemannian metric (up to rescaling) that is invariant under certain, statistically important transformations [10, 16, 9, 2, 7]. This effectively means that the Fisher information metric is the only natural metric on a statistical model, so many statistical properties of these models should be describable in terms of this metric. Known examples of this correspondence between statistical and geometric properties include: the Cramér-Rao lower bound for the variance of an unbiased estimator in terms of the inverse of the Fisher information metric [1, Thm. 2.2]; orthogonality as a criterion for first-order efficiency of estimators [1, Thm. 4.3]; the central role of statistical curvature in the information loss of an efficient estimator [12, §3.3] and in second-order efficiency [12, §3.4]; and the spontaneous emergence of the Fisher information volume [15] in the minimum description length (MDL) approach to statistical model selection [6].

The original version of Chentsov’s theorem [10, 16, 9] only applied in the restricted setting of statistical models with finite data spaces. This version of the theorem says that the Fisher information metric is the only metric (up to a multiplicative constant) that is defined on all models with finite data spaces and is invariant under all sufficient statistics. Recall that a statistical model $\mathcal{M}$ is a (sufficiently regular) set of probability measures on the same measurable space $\mathcal{X}$ , which we call the data space of $\mathcal{M}$ , and that a sufficient statistic for $\mathcal{M}$ is a function on $\mathcal{X}$ for which the conditional distribution of any measure $P$ in $\mathcal{M}$ , given the sufficient statistic, is the same for all $P$ . Sufficient statistics induce corresponding maps on statistical models (the measure-theoretic push-forward maps) and the invariance assumption above is that all of these maps are isometries (i.e., distance-preserving maps).

Since the assumption of finite data spaces is very restrictive, Ay et al. [2] proved a version of Chentsov’s theorem that applies to models whose data spaces $\mathcal{X}$ are smooth manifolds. Their version says that the Fisher information metric is the only metric (up to rescaling) that is defined on all statistical models with a given data space $\mathcal{X}$ and is invariant under all sufficient statistics, including discontinuous ones. This version of Chentsov’s theorem applies to many interesting statistical models but it makes very strong assumptions about both the breadth of the models on which the metrics are defined and the invariance properties of these metrics. Therefore Bauer et al. [7] proved a version of Chentsov’s theorem which says the Fisher information metric is the only metric (up to rescaling) that firstly is defined on the space of all smooth, positive densities on a compact manifold $\mathcal{X}$ of dimension $2$ or higher and secondly is invariant under all diffeomorphisms from $\mathcal{X}$ to itself (where diffeomorphisms are smooth maps with smooth inverses, so they are a special type of sufficient statistic). The proof of Bauer et al. [7] was based on results from the theory of generalized functions, especially the Schwartz kernel theorem [11, §6.1], and it made far weaker invariance assumptions than that of Ay et al. [2]. The assumption that $\mathcal{X}$ is a compact manifold without boundary excludes many cases of interest to statisticians, though Bauer et al. [7] say this assumption can be weakened.

Despite their beauty and generality, the results of Ay et al. [2] and Bauer et al. [7] leave open the possibility that there might exist a natural metric other than the Fisher information metric on an individual statistical model $\mathcal{M}$ . This could occur, for example, if there is a natural metric on $\mathcal{M}$ that does not (invariantly) extend to a metric on the infinite-dimensional models of [2] and [7] that contain $\mathcal{M}$ and many unrelated models. Also, exponential families have a distinguished, finite-dimensional set of sufficient statistics, called the canonical sufficient statistics, which are related to their natural affine structures ([1, Thm. 2.4] and [3, Lemma 8.1]). Therefore, the invariance assumptions of [2] and [7] are arguably too strong for exponential families, and instead it would be more natural to consider invariance under canonical sufficient statistics rather than all sufficient statistics.

In this paper, we prove a refined version of Chentsov’s theorem in the important case of exponential families. Instead of considering metrics defined on an infinite-dimensional statistical model, as in [2] and [7], we consider metrics defined only on a given exponential family $\mathcal{M}$ and some of its derived families, namely its independent and identically distributed (IID) extensions and their corresponding natural exponential families. Instead of assuming these metrics are invariant under all sufficient statistics or all diffeomorphisms, we assume invariance under canonical sufficient statistics and IID extensions. This assumption of invariance under IID extensions has no analogue in previous work, but IID extensions are natural and important transformations between statistical models (perhaps more so than sufficient statistics), so this invariance assumption is arguably more natural than invariance under sufficient statistics. Also, this extra invariance assumption is offset by the fact that we restrict our sufficient statistics to the canonical ones. Then, under a mild regularity condition, we prove that metrics with these invariance properties are multiples of the Fisher information metric (see Theorem 1 in Section 5). This result therefore gives a new characterisation of the Fisher information metric as the only metric on an exponential family and its derived families that is invariant under canonical sufficient statistics and IID extensions.

Our approach has a number of advantages: as discussed above, we only assume that the metric is defined on an individual model and its related models, and our invariance assumptions respect the natural affine structures of exponential families; we only consider metrics on a collection of finite-dimensional models (similar to the original version of Chentsov’s theorem [10, 16, 9]), which allows us to avoid the technicalities encountered in [2] and [7] because of the infinite-dimensionality of their statistical models; our proof is unified for discrete and continuous distributions, unlike the proofs of [10, 16, 9] and [7], so there is some hope of extending our proof to general statistical models; our proof shows that Chentsov’s theorem is a corollary of the central limit theorem, which makes this result more understandable and intuitive; and our results complement those of [7], since (curved) exponential families are essentially the only statistical models with smooth sufficient statistics that are not diffeomorphisms, by the Pitman–Koopman–Darmois theorem [5].

The rest of this paper is set out as follows. In Section 2 we define the Fisher information metric and some relevant notions from differential geometry, as they apply in our main case of interest. In Section 3 we briefly recall the definition of an exponential family and some of its derived families. We then give precise descriptions of our assumptions in Section 4, before using these assumptions and the central limit theorem to prove our characterisation of the Fisher information metric in Section 5. We then describe an extension of our proof to higher-order symmetric tensors in Section 6, before finishing with a discussion of our results in Section 7. Section 7 also begins with a non-technical summary of our proof.

2 The Fisher information metric

This section briefly recalls the definitions of tangent vectors and the Fisher information metric of a statistical model. A general reference for the notions from Riemannian geometry described here is [12, Appendix C].

In all later sections of this paper, we will take $\mathcal{M}$ to be a regular exponential family with natural parameter space $\Theta$ , but in this section we let $\mathcal{M}$ be a more general statistical model and let $\Theta$ be any smooth parameter space for $\mathcal{M}$ . More precisely, suppose $\Theta$ is an open subset of $\mathbb{R}^{d}$ and that $\mu$ is a measure on $\mathbb{R}^{m}$ with support $\mathcal{X}$ . Then our statistical model is $\mathcal{M}=\{p_{\theta}\mu\mid\theta\in\Theta\}$ , where each $p_{\theta}:\mathcal{X}\to\mathbb{R}_{>0}$ is a $\mu$ -integrable, strictly positive function that is normalized, meaning $1=\int p_{\theta}d\mu$ . Note that $\mathcal{M}$ is a set of probability measures on $\mathbb{R}^{m}$ . We assume that the parameterisation of $\mathcal{M}$ by $\Theta$ is smooth, in the sense that $\theta\mapsto p_{\theta}(x)$ is a smooth (i.e., infinitely differentiable) function for $\mu$ -almost all $x$ . We also assume that the parameterisation is non-singular, meaning that the parameterisation map $\Theta\to\mathcal{M}$ given by $\theta\mapsto p_{\theta}\mu$ is injective and that it maps non-zero tangent vectors to non-zero tangent vectors, in a sense which will become clear below.

Because $\Theta$ is an open subset of $\mathbb{R}^{d}$ , any tangent vector $u$ to $\Theta$ is a pair $u=(\theta,a)$ for some $\theta\in\Theta$ and some $a\in\mathbb{R}^{d}$ , where $\theta$ is called the base-point of $u$ . The set of all such tangent vectors, which is denoted $T\Theta$ and is called the tangent bundle of $\Theta$ , is therefore $T\Theta=\Theta\times\mathbb{R}^{d}$ . The tangent bundle is not a vector space in general, but the set of all tangent vectors with the same base-point is. The vector space $T_{\theta}\Theta$ consisting of all vectors with base-point $\theta$ is called the tangent space to $\Theta$ at $\theta$ . Addition and scalar multiplication in this vector space are given by

[TABLE]

for any $u,v\in T_{\theta}\Theta$ and any $s,t\in\mathbb{R}$ , where $u=(\theta,a)$ and $v=(\theta,b)$ . Note that addition and scalar multiplication in $T_{\theta}\Theta$ effectively ignore the shared base-point $\theta$ .

Similarly, we can view each tangent vector to the statistical model $\mathcal{M}$ as a pair $(P,A)$ , where the base-point $P$ is an element of the model $\mathcal{M}$ and $A$ is essentially the score in a particular direction [14, §3.3]. More precisely, for each tangent vector $u=(\theta,a)$ to $\Theta$ , there is a corresponding tangent vector $\tilde{u}=(P,A)$ to $\mathcal{M}$ given by

[TABLE]

(The function taking $u$ to $\tilde{u}$ is the differential of the parameterisation $\theta\mapsto p_{\theta}\mu$ [12, Def. C.3.4].) Let the tangent bundle $T\mathcal{M}$ of $\mathcal{M}$ be the set of all such tangent vectors, i.e., let $T\mathcal{M}=\{\tilde{u}\mid u\in T\Theta\}$ . Also, let the tangent space $T_{P}\mathcal{M}$ to $\mathcal{M}$ at $P\in\mathcal{M}$ be the vector space consisting of all tangent vectors $(P,A)\in T\mathcal{M}$ with base-point $P$ . Even though we have used a particular parameterisation of $\mathcal{M}$ to define $T_{P}\mathcal{M}$ , this tangent space is natural, in the sense that $T_{P}\mathcal{M}$ is the same for all smooth parameterisations.

The Fisher information metric $g^{F}$ on $\mathcal{M}$ is given by

[TABLE]

for any tangent vectors $\tilde{u}=(P,A)$ and $\tilde{v}=(P,B)$ in the tangent space $T_{P}\mathcal{M}$ [7, §3], where $dA/dP$ and $dB/dP$ are Radon-Nikodym derivatives [8, §3.2]. It is straightforward (see Appendix A) to show that definition (3) for the Fisher information metric reduces to the usual, parameterisation-dependent definition [1, eq. 2.6]. However, the formulation (3) will be more useful to us than the usual definition. Also, because (3) is phrased only in terms of natural constructions, this formula makes it clear that $g^{F}$ does not depend on arbitrary choices, such as the choice of parameterisation.

A Riemannian metric on a set is just a function that puts an inner product on each of the set’s tangent spaces (if the set is suitably regular and the inner products vary smoothly with the base-point). For example, a Riemannian metric on $\Theta$ can be thought of as a smooth, matrix-valued function on $\Theta$ whose value at $\theta\in\Theta$ is a $d\times d$ , symmetric, positive definite matrix $\bar{g}_{\theta}$ , since this defines an inner product on each $T_{\theta}\Theta$ with the inner product of any $u,v\in T_{\theta}\Theta$ being $g(u,v)=a^{T}\bar{g}_{\theta}b$ , where $u=(\theta,a)$ and $v=(\theta,b)$ .

In our main case of interest, where $\mathcal{M}$ is an exponential family, the integral in (3) always converges [12, Thm. 2.2.5]. Then it is not hard to see that (3) defines an inner product on each tangent space to $\mathcal{M}$ (and this varies smoothly with the base-point), so the Fisher information metric $g^{F}$ is a Riemannian metric on $\mathcal{M}$ .

3 Exponential families and their derived families

Partly to establish our notation, this section briefly recalls the definitions of an exponential family, its IID extensions and their corresponding natural exponential families.

3.1 Exponential families

Let $\mu$ be a measure on $\mathbb{R}^{m}$ and let $T:\mathcal{X}\to\mathbb{R}^{d}$ be a measurable function, where $\mathcal{X}\subseteq\mathbb{R}^{m}$ is the support of $\mu$ . Let

[TABLE]

where the dot ( $\cdot$ ) denotes the Euclidean inner product on $\mathbb{R}^{d}$ . For each $\theta\in\Theta$ , define $p_{\theta}:\mathcal{X}\to\mathbb{R}_{>0}$ by

[TABLE]

for any $x\in\mathcal{X}$ , where $Z:\Theta\to\mathbb{R}$ is the partition function $Z(\theta)=\int\exp(\theta\cdot T)d\mu$ . Assume that $\Theta$ is a non-empty, open subset of $\mathbb{R}^{d}$ and that $T$ is full rank, in the sense that the image of $T$ is not contained in any $(d-1)$ -dimensional hyperplane in $\mathbb{R}^{d}$ . Then $\mathcal{M}=\{p_{\theta}\mu\mid\theta\in\Theta\}$ is a regular exponential family of order $d$ with dominating measure $\mu$ and canonical sufficient statistic $T$ , and all regular exponential families are of this form [3, §8.1]. Note that each element of $\mathcal{M}$ is a probability measure on $\mathbb{R}^{m}$ .

3.2 IID extensions

The $n$ -fold IID extension $\mathcal{M}^{n}$ of $\mathcal{M}$ is the set $\mathcal{M}^{n}=\{P^{n}\mid P\in\mathcal{M}\}$ of all measures of the form $P^{n}$ for some $P\in\mathcal{M}$ , where $P^{n}=P\times\dots\times P$ (with $n$ copies of $P$ ) is the product measure on $\mathcal{X}^{n}$ [8, §3.3]. In terms of the parameterisation (4), $\mathcal{M}^{n}$ is the set of all measures of the form $p_{\theta}^{(n)}\mu^{n}$ for some $\theta\in\Theta$ , where $p_{\theta}^{(n)}:\mathcal{X}^{n}\to\mathbb{R}_{>0}$ is given by $p_{\theta}^{(n)}(x_{1},\dots,x_{n})=p_{\theta}(x_{1})\dots p_{\theta}(x_{n})$ and $\mu^{n}=\mu\times\dots\times\mu$ is the product measure on $\mathcal{X}^{n}$ [3, Example 8.12(ii)]. So by (4),

[TABLE]

where $T_{n}:\mathcal{X}^{n}\to\mathbb{R}^{d}$ is given by $T_{n}(x_{1},\dots,x_{n})=(T(x_{1})+\dots+T(x_{n}))/n$ for any $x_{1},\dots,x_{n}\in\mathcal{X}$ . Therefore $\mathcal{M}^{n}$ is an exponential family with dominating measure $\mu^{n}$ and sufficient statistic $T_{n}$ (and natural parameter $n\theta$ , see [12, Thm. 2.2.6]). Note that $\mathcal{M}^{1}=\mathcal{M}$ , $T_{1}=T$ and $p_{\theta}^{(1)}=p_{\theta}$ .

3.3 Natural exponential families

Recall that if $\mathcal{Y}$ and $\mathcal{Z}$ are measurable spaces, $\phi:\mathcal{Y}\to\mathcal{Z}$ is a measurable function and $P$ is a measure on $\mathcal{Y}$ then the push-forward of $P$ via $\phi$ is the measure $\phi_{*}P$ on $\mathcal{Z}$ given by

[TABLE]

for any measurable set $U$ in $\mathcal{Z}$ [8, §3.6]. This immediately implies that if $Y$ is a $\mathcal{Y}$ -valued random variable with distribution $P$ then $\phi(Y)$ is a $\mathcal{Z}$ -valued random variable with distribution $\phi_{*}P$ , which in symbols we write as

[TABLE]

Then the natural exponential family corresponding to $\mathcal{M}^{n}$ and $T_{n}$ is the set $\mathcal{N}_{n}=\{{T_{n}}_{*}P^{n}\mid P^{n}\in\mathcal{M}^{n}\}$ of measures on $\mathbb{R}^{d}$ . By [3, Examples 8.12(ii) and 8.12(iii)], $\mathcal{N}_{n}=\{q_{\theta}^{n}\nu_{n}\mid\theta\in\Theta\}$ , where $\nu_{n}$ is a measure on $\mathbb{R}^{d}$ which does not depend on $\theta$ and $q_{\theta}^{n}:\mathbb{R}^{d}\to\mathbb{R}_{>0}$ is given by

[TABLE]

for any $y\in\mathbb{R}^{d}$ . The formula (8) shows that the superscript in $q_{\theta}^{n}$ is actually an exponent, so we will write $q_{\theta}$ for $q_{\theta}^{1}$ (and then the notation $q_{\theta}^{n}$ is unambiguous).

Note that even though $\mathcal{M},\mathcal{M}^{2},\mathcal{M}^{3},\dots$ and $\mathcal{N}_{1},\mathcal{N}_{2},\mathcal{N}_{3},\dots$ are families of measures on different spaces (namely, $\mathcal{X},\mathcal{X}^{2},\mathcal{X}^{3},\dots$ and $\mathbb{R}^{d},\mathbb{R}^{d},\mathbb{R}^{d},\dots$ , respectively), they are all parameterised by $\Theta\subseteq\mathbb{R}^{d}$ so they are all $d$ -dimensional families of measures.

4 Invariance and regularity conditions

Let $\mathcal{M}$ , $\mathcal{M}^{n}$ and $\mathcal{N}_{n}$ be as in Section 3 and suppose now that these spaces have been equipped with Riemannian metrics $g$ , $g^{n}$ and $g_{n}$ , respectively. In this section, we will give precise conditions that formalize the notion of these metrics being invariant under IID extensions and canonical sufficient statistics, as well as giving a mild regularity condition. These conditions will then be used in Section 5 to prove our main theorem. See Section 4.4 for a number of remarks about these assumptions.

Assumptions.

We make the following assumptions, which are described precisely in the subsections below:

A1

The metrics $g$ and $g^{n}$ are invariant under IID extensions (up to a factor of $n$ ) 2. A2

The metrics $g^{n}$ and $g_{n}$ are invariant under canonical sufficient statistics 3. A3

The norms corresponding to the metrics $g_{n}$ can all be calculated by a function that satisfies a weak continuity condition

4.1 A1: Invariance under IID extensions

Let $IID_{n}:\mathcal{M}\to\mathcal{M}^{n}$ be the function which maps each $P\in\mathcal{M}$ to the product measure $P^{n}=P\times\dots\times P$ (see Section 3.2). Then our first assumption is that this map is an isometry (i.e., distance-preserving map) up to a factor of $n$ .

More precisely, let $u=(\theta,a)\in T\Theta$ be any tangent vector to $\Theta$ , as in Section 2. Then similarly to (2), $u$ corresponds under the smooth parameterisation (5) to a tangent vector $\tilde{u}^{n}$ to $\mathcal{M}^{n}$ , where $\tilde{u}^{n}=(P^{n},A^{(n)})$ , $P^{n}=p_{\theta}^{(n)}\mu^{n}$ and $A^{(n)}=\sum_{i=1}^{d}a_{i}(\partial p_{\theta}^{(n)}/\partial\theta_{i})\mu^{n}$ . Let $T\mathcal{M}^{n}=\{\tilde{u}^{n}\mid u\in T\Theta\}$ be the set of all such tangent vectors to $\mathcal{M}^{n}$ . Then our first assumption is that

[TABLE]

for all tangent vectors $u,v\in T\Theta$ with the same base-point. Here, $\tilde{v}$ and $\tilde{v}^{n}$ are the tangent vectors to $\mathcal{M}$ and $\mathcal{M}^{n}$ (respectively) corresponding to $v\in T\Theta$ , as for $u$ above. Note that (9) just says that $g^{n}=ng$ under the identification of $\mathcal{M}$ with $\mathcal{M}^{n}$ via $IID_{n}$ .

The Fisher information metric is invariant under IID extensions in the sense of (9) by [1, eq. 4.2], so assumptions (A1)–(A3) cannot characterize the Fisher information metric unless the factor of $n$ is included in (9) (though see Remark 4).

4.2 A2: Invariance under canonical sufficient statistics

Let $T_{n}:\mathcal{X}^{n}\to\mathbb{R}^{d}$ be the canonical sufficient statistic from Section 3.2 and let ${T_{n}}_{*}:\mathcal{M}^{n}\to\mathcal{N}_{n}$ be the corresponding (measure-theoretic) push-forward map of $T_{n}$ , see Section 3.3. Then our second assumption is that this map ${T_{n}}_{*}$ is an isometry (and that all other canonical sufficient statistics are isometries, in a sense which will be made precise in Section 4.3).

More precisely, let $u=(\theta,a)\in T\Theta$ be any tangent vector to $\Theta$ , as in Section 2. Then similarly to (2), $u$ corresponds under the smooth parameterisation (8) to a tangent vector $\tilde{u}_{n}=(Q_{n},A_{n})$ to $\mathcal{N}_{n}$ , where

[TABLE]

Let $T\mathcal{N}_{n}=\{\tilde{u}_{n}\mid u\in T\Theta\}$ be the set of all such tangent vectors. Then our second assumption is that

[TABLE]

for all tangent vectors $u,v\in T\Theta$ with the same base-point. Here, $\tilde{v}^{n}$ and $\tilde{v}_{n}$ are the tangent vectors to $\mathcal{M}^{n}$ and $\mathcal{N}_{n}$ (respectively) corresponding to $v\in T\Theta$ , as for $u$ above. Note that (11) just says that $g_{n}=g^{n}$ under the identification of $\mathcal{M}^{n}$ with $\mathcal{N}_{n}$ via ${T_{n}}_{*}$ .

4.3 A3: Calculability of norms by a function that satisfies a weak continuity condition

Let $h$ be the norm corresponding to $g$ , so $h(\tilde{u})=\sqrt{g(\tilde{u},\tilde{u})}$ for any $\tilde{u}\in T\mathcal{M}$ . Note that $h$ determines $g$ by the polarisation formula,

[TABLE]

for any $\tilde{u},\tilde{v}\in T\mathcal{M}$ with the same base-point (which follows from the bilinearity of $g$ ), so any question about $g$ can be phrased in terms of $h$ . However, it will be more convenient to work with $h$ than $g$ , because $h$ is a function defined on $T\mathcal{M}$ , whereas $g$ is only defined on certain pairs of tangent vectors (those with the same base-point). Similarly, let $h_{n}$ be the norm corresponding to $g_{n}$ , so $h_{n}(\tilde{u}_{n})=\sqrt{g_{n}(\tilde{u}_{n},\tilde{u}_{n})}$ for any $\tilde{u}_{n}\in T\mathcal{N}_{n}$ .

Let $\mathcal{T}^{\prime}$ be the set of all pairs $(P,A)$ , where $P$ is a probability measure on $\mathbb{R}^{d}$ and $A$ is a signed measure on $\mathbb{R}^{d}$ , and note that $T\mathcal{N}_{n}\subseteq\mathcal{T}^{\prime}$ for every $n$ . Then our regularity condition (A3) is, firstly, that there is subset $\mathcal{T}$ of $\mathcal{T}^{\prime}$ and a function $H:\mathcal{T}\to\mathbb{R}$ so that, for each $n$ , $T\mathcal{N}_{n}\subseteq\mathcal{T}$ (i.e. $H$ is defined on each $T\mathcal{N}_{n}$ ) and

[TABLE]

for every $\tilde{u}_{n}\in T\mathcal{N}_{n}$ . In other words, we assume that there is some function $H$ whose restriction to each $T\mathcal{N}_{n}$ is the norm $h_{n}$ . For instance, we could take $\mathcal{T}=\cup_{n=1}^{\infty}T\mathcal{N}_{n}$ and then define $H$ by the requirement that (12) holds, which gives a well-defined $H$ whenever the functions $h_{n}$ agree on any overlaps between the spaces $T\mathcal{N}_{n}$ .

Further, we assume that $H$ has the following weak continuity property. Firstly, we require that $H$ is defined on all pairs of the form $(\Phi,f\Phi)$ , where $\Phi$ is the probability measure for the standard normal distribution on $\mathbb{R}^{d}$ and $f:\mathbb{R}^{d}\to\mathbb{R}$ is a linear function (with $f(0)=0$ ). Secondly, we require that

[TABLE]

for any sequence $P_{n}$ of probability measures on $\mathbb{R}^{d}$ for which $H(P_{n},fP_{n})$ is constant in $n$ , $P_{n}\Rightarrow\Phi$ and each $P_{n}$ is standardized (i.e., $P_{n}$ has [math] mean and identity variance-convariance matrix), where $H(P_{n},fP_{n})$ is the value of the function $H$ at $(P_{n},fP_{n})\in\mathcal{T}$ and $P_{n}\Rightarrow\Phi$ means $P_{n}$ converges to $\Phi$ in the sense of the weak convergence of measures [13, Def. 1.2.1]. This condition is an extremely weak form of continuity, see Remark 1.

Lastly, as a consequence of our assumption (A2) that the metrics should be invariant under all canonical sufficient statistics, we assume that $H$ is affine invariant (see Remark 6). Here, an invertible affine transformation of $\mathbb{R}^{d}$ is a map $L:\mathbb{R}^{d}\to\mathbb{R}^{d}$ of the form $L(x)=Mx+c$ for some invertible $d\times d$ matrix $M$ and some $c\in\mathbb{R}^{d}$ . The push-forward $L_{*}A$ of any signed measure $A$ on $\mathbb{R}^{d}$ is defined in a similar way to the push-forward of an (unsigned) measure, see (6). We define the push-forward $L_{**}(P,A)$ of any $(P,A)\in\mathcal{T}$ to be $L_{**}(P,A)=(L_{*}P,L_{*}A)$ . (In this notation, $L_{*}$ is the measure-theoretic push-forward, which is a map from the space of signed measures on $\mathbb{R}^{d}$ to itself, and $L_{**}$ is the differential of this map if $(P,A)$ is interpreted as a tangent vector.) Then our condition that $H$ is affine invariant means that $L_{**}(P,A)\in\mathcal{T}$ and

[TABLE]

for every $(P,A)\in\mathcal{T}$ and every invertible affine transformation $L$ of $\mathbb{R}^{d}$ .

For future reference, we note that if $L$ is an invertible affine transformation, $P$ is a probability measure and $f$ is a $P$ -integrable, real-valued function then

[TABLE]

by the change of variables formula [8, Thm. 3.6.1].

4.4 Remarks on the assumptions

Remark 1.

Assumptions (A1) and (A2) say that the metrics on $\mathcal{M}$ , $\mathcal{M}^{n}$ and $\mathcal{N}_{n}$ are invariant under a countable set of transformations and, in a certain sense, under the finite-dimensional group of affine transformations of $\mathbb{R}^{d}$ . The third assumption (A3) is an extremely weak form of continuity. Firstly, this condition says that the norms $h_{n}$ agree on any overlaps between the spaces $T\mathcal{N}_{n}$ , so that these functions can be pieced together into a single function $H$ . Secondly, this condition says that if $f$ is linear and $P_{n}\Rightarrow\Phi$ is a sequence for which $(P_{n},fP_{n})$ all have the same norms then this shared norm must be $H(\Phi,f\Phi)$ . By comparison, full continuity of $H$ would require that $\lim_{n\to\infty}H(P_{n},f_{n}P_{n})=H(P,fP)$ for every sequence $(P_{n},f_{n}P_{n})$ in $\mathcal{T}$ that converges to $(P,fP)$ (with respect to some notion of convergence). So our third assumption is the condition for the continuity of $H$ in the very special case where $P=\Phi$ , $H(P_{n},f_{n}P_{n})$ is constant in $n$ , $f_{n}=f$ for every $n$ and $f$ is a linear function.

Remark 2.

Recent versions of Chentsov’s theorem [2, 7] consider metrics on infinite-dimensional statistical models that are invariant under infinite-dimensional sets of transformations. This infinite dimensionality introduces technical complications and it makes strong assumptions about both the space on which the metric is defined and its symmetries. By contrast, our approach allows us to only consider metrics on a collection of finite-dimensional models, as in the original version of Chentsov’s theorem [10, 16, 9]. This allows our characterisation of the Fisher information metric to be relatively free from technicalities and it allows us to make relatively weak invariance and regularity assumptions.

Remark 3.

It is not hard to see that the Fisher information metric satisfies assumptions (A1)–(A3). For it is well known that the Fisher information metric is invariant under both IID extensions (in the sense of (9)) and sufficient statistics [1, eq. 4.2 and Thm. 2.1]. Also, given any probability measure $P$ on $\mathbb{R}^{d}$ , let $\mathcal{T}_{P}=\{(P,fP)\mid f\in L^{2}(\mathbb{R}^{d},P)\}$ , and let $\mathcal{T}$ be the union of these spaces $\mathcal{T}_{P}$ as $P$ ranges over the set of all probability measures on $\mathbb{R}^{d}$ . Then by (3), the Fisher information norm $H^{F}(P,fP)$ of any $(P,fP)\in\mathcal{T}$ is just the $L^{2}(\mathbb{R}^{d},P)$ -norm of $f$ . So if $f$ is a linear function on $\mathbb{R}^{d}$ , say $f(y)=c\cdot y$ for some $c\in\mathbb{R}^{d}$ , and $Q$ is any standardized probability measure on $\mathbb{R}^{d}$ then

[TABLE]

where $\|c\|$ is the Euclidean norm of $c\in\mathbb{R}^{d}$ . So for any sequence $P_{n}$ of standardized probability measures (whether weakly convergent to $\Phi$ or not), $H^{F}(P_{n},fP_{n})=\|c\|=H^{F}(\Phi,f\Phi)$ , so $H^{F}$ satisfies the weak continuity condition (13). Lastly, this function $H^{F}$ is affine invariant (14) by the change of variables formula (15).

Remark 4.

In some ways the factor of $n$ in (9) is not essential, since we could instead formulate our assumptions and theorems in terms of the metrics $\dot{g}^{n}=g^{n}/n$ and $\dot{g}_{n}=g_{n}/n$ , in which case (9) would be equivalent to the equation that describes exact invariance under the map $IID_{n}$ , rather than invariance up to a factor of $n$ (though $H$ as in (12) might not exist without the factor of $n$ ). However, it is natural to include the factor of $n$ in our formulation of IID invariance, firstly because the Fisher information metric is IID invariant in the sense of (9) [1, eq. 4.2], so assumptions (A1)–(A3) would not characterise the Fisher information metric without this factor, and secondly because the factor of $n$ arises from a natural construction from differential geometry (see Remark 5).

Remark 5.

Given an arbitrary Riemannian metric $g$ on $\mathcal{M}$ , a natural construction from differential geometry gives a metric on the $n$ -fold IID extension $\mathcal{M}^{n}$ of $\mathcal{M}$ equal to the metric $g^{n}$ satisfying (9), as follows. The Cartesian product $\prod^{n}\mathcal{M}$ of $\mathcal{M}$ with itself $n$ times is the space whose points are $n$ -tuples $(P_{1},\dots,P_{n})$ of measures $P_{1},\dots,P_{n}\in\mathcal{M}$ on $\mathcal{X}$ . Given such an $n$ -tuple, there is a corresponding product measure $P_{1}\times\dots\times P_{n}$ on $\mathcal{X}^{n}$ , and conversely we can recover each $P_{i}$ from $P_{1}\times\dots\times P_{n}$ by marginalizing, so we can identify $(P_{1},\dots,P_{n})$ with the product measure $P_{1}\times\dots\times P_{n}$ on $\mathcal{X}^{n}$ . This product measure is the joint distribution of independent random variables $X_{1},\dots,X_{n}$ whose marginal distributions are $P_{1},\dots,P_{n}$ , respectively. So if $(P_{1},\dots,P_{n})\in\prod^{n}\mathcal{M}$ satisfies $P_{1}=\dots=P_{n}$ then $P_{1}\times\dots\times P_{n}$ is the joint distribution of IID random variables $X_{1},\dots,X_{n}$ . Therefore we can identify the diagonal

[TABLE]

of $\prod^{n}\mathcal{M}$ with the $n$ -fold IID extension $\mathcal{M}^{n}$ of $\mathcal{M}$ . But a Riemannian metric on $\mathcal{M}$ induces a Riemannian metric on the Cartesian product $\prod^{n}\mathcal{M}$ , and then $\Delta$ inherits a metric from its super-manifold $\prod^{n}\mathcal{M}$ . Under the above identification between $\Delta$ and $\mathcal{M}^{n}$ , this metric is the metric $g^{n}$ on $\mathcal{M}^{n}$ that satisfies (9).

Remark 6.

The canonical sufficient statistics for an exponential family are only unique up to affine transformations [3, Lemma 8.1], meaning that if $L$ is an invertible affine transformation of $\mathbb{R}^{d}$ and $T_{n}:\mathcal{X}^{n}\to\mathbb{R}^{d}$ is a canonical sufficient statistic then $L\circ T_{n}$ is also a canonical sufficient statistic (and every canonical sufficient statistic is of this form). Replacing $T_{n}$ by $L\circ T_{n}$ effectively replaces each tangent vector $\tilde{u}_{n}\in T\mathcal{N}_{n}$ by $L_{**}\tilde{u}_{n}$ , so (11), (12) and the analogous equations for $L\circ T_{n}$ imply $H(L_{**}\tilde{u}_{n})=H(\tilde{u}_{n})$ for every $\tilde{u}_{n}\in T\mathcal{N}_{n}$ . So since $L$ is arbitrary, $H$ is affine invariant.

5 The main theorem

We can now prove our version of Chentsov’s theorem. This theorem characterises the Fisher information metric as the only metric (up to rescaling) on an exponential family that is invariant under IID extensions and canonical sufficient statistics.

Let $g^{F}$ , $g^{nF}$ and $g_{n}^{F}$ be the Fisher information metrics on $\mathcal{M}$ , $\mathcal{M}^{n}$ and $\mathcal{N}_{n}$ , respectively.

Theorem 1.

Suppose that assumptions (A1)–(A3) of Section 4 hold. Then there is some $c>0$ so that $g=cg^{F}$ , $g^{n}=cg^{nF}$ and $g_{n}=cg_{n}^{F}$ for every integer $n\geq 1$ .

Proof.

Let any integer $n\geq 1$ and any $\theta\in\Theta$ be given, and let $Q_{1}=q_{\theta}\nu_{1}\in\mathcal{N}_{1}$ and $Q_{n}=q_{\theta}^{n}\nu_{n}\in\mathcal{N}_{n}$ be the corresponding distributions in $\mathcal{N}_{1}$ and $\mathcal{N}_{n}$ . By Theorem 2.2.6 of [12] and the comments preceding it, if $Y_{1},\dots,Y_{n}$ are independent random variables all distributed according to $Q_{1}$ then their mean is distributed as $Q_{n}$ , which we write as

[TABLE]

Alternatively, it is not hard to prove (16), since if $X_{1},\dots,X_{n}\sim p_{\theta}\mu$ are IID and $Y_{i}^{\prime}=T(X_{i})$ then $Y_{1}^{\prime},\dots,Y_{n}^{\prime}\sim Q_{1}$ are IID and $(Y_{1}^{\prime}+\dots+Y_{n}^{\prime})/n=T_{n}(X_{1},\dots,X_{n})\sim Q_{n}$ , by (7) and since $Q_{1}=T_{*}P$ and $Q_{n}=T_{n*}P^{n}$ (by definition), where $P=p_{\theta}\mu$ . This proves (16) because $Y_{1},\dots,Y_{n}$ and $Y_{1}^{\prime},\dots,Y_{n}^{\prime}$ have the same joint distribution so their means have the same distribution, by another application of (7).

By (16), the mean $\tau_{\theta}$ for $Q_{1}$ is the same as that for $Q_{n}$ , i.e.

[TABLE]

and the variance-covariance matrix $\Sigma_{\theta}$ for $Q_{1}$ is $n$ times that for $Q_{n}$ , i.e.

[TABLE]

Now, let $u=(\theta,a)\in T_{\theta}\Theta$ be any tangent vector to $\Theta$ at $\theta$ , and define $f:\mathbb{R}^{d}\to\mathbb{R}$ by $f(y)=(\Sigma_{\theta}^{1/2}a)\cdot y$ for any $y\in\mathbb{R}^{d}$ . Here, $\Sigma_{\theta}^{1/2}$ is defined in the standard way via a diagonalisation of the symmetric, positive-definite matrix $\Sigma_{\theta}$ . As before, let $\tilde{u}$ and $\tilde{u}_{n}$ , respectively, be the tangents to $\mathcal{M}$ and $\mathcal{N}_{n}$ that correspond to $u$ under the parameterisations (5) and (8).

Claim 1: $h(\tilde{u})=H(\Phi,f\Phi)$ . By (8), (10) and the fact that $\tau_{\theta}$ is the gradient of $\log Z$ at $\theta$ [12, Thm. 2.2.1], $\tilde{u}_{n}=(Q_{n},A_{n})$ with $Q_{n}=q_{\theta}^{n}\nu_{n}$ and

[TABLE]

where $\iota_{i}(y)=y_{i}$ and $\iota(y)=y$ for any $y\in\mathbb{R}^{d}$ .

Let $L$ be the affine transformation on $\mathbb{R}^{d}$ given by $L(y)=\sqrt{n}\Sigma_{\theta}^{-1/2}(y-\tau_{\theta})$ , and note that $\Sigma_{\theta}^{-1/2}$ exists because $\Sigma_{\theta}$ is positive-definite. By (17) and (18), this choice of $L$ ensures that $L_{*}Q_{n}$ is standardised, i.e., that $L_{*}Q_{n}$ has mean [math] and variance-covariance matrix equal to the $d\times d$ identity matrix. Note that $L$ depends on $n$ , so we could instead write this as $L_{n}$ , but for notational simplicity we will drop the subscript. Then by (15) and (19),

[TABLE]

where $f$ is as in the statement of the claim.

So recalling the notation $L_{**}\tilde{u}_{n}=L_{**}(Q_{n},A_{n})=(L_{*}Q_{n},L_{*}A_{n})$ , we have

[TABLE]

By (16), the central limit theorem (e.g. see [13, Cor. 8.1.10]) and the fact that $L_{*}Q_{n}$ is standardised, $L_{*}Q_{n}\Rightarrow\Phi$ . Therefore,

[TABLE]

so the claim is proved.

Now, let $v=(\phi,b)\in T\Theta$ be any tangent vector to $\Theta$ , not necessarily with the same base-point as $u$ , and let $\tilde{v}\in T\mathcal{M}$ be the corresponding tangent vector to $\mathcal{M}$ .

Claim 2: $a^{T}\Sigma_{\theta}a=b^{T}\Sigma_{\phi}b$ implies $h(\tilde{u})=h(\tilde{v})$ . To prove this, assume that $a^{T}\Sigma_{\theta}a=b^{T}\Sigma_{\phi}b$ , i.e. that $\Sigma_{\theta}^{1/2}a$ and $\Sigma_{\phi}^{1/2}b$ have the same Euclidean norm. Then there exists a $d\times d$ orthogonal matrix $M$ so that

[TABLE]

Also, $M_{*}\Phi=\Phi$ because $M$ is orthogonal, so

[TABLE]

by (15), where $e:\mathbb{R}^{d}\to\mathbb{R}$ is given by

[TABLE]

for any $y\in\mathbb{R}^{d}$ , by (23) and $M^{-1}=M^{T}$ (since $M$ is orthogonal). So

[TABLE]

which proves Claim 2.

*Claim 3: There is some $c>0$ so that $h(\tilde{v})=c\;h^{F}(\tilde{v})$ for all tangent vectors $\tilde{v}\in T\mathcal{M}$ . * It is well-known [12, Thms. 2.2.1 and 2.2.5] that the Fisher information metric on the natural parameter space is the variance-covariance matrix of the corresponding sufficient statistic, so $g^{F}(\tilde{u},\tilde{u})=a^{T}\Sigma_{\theta}a$ . Alternatively, this follows easily from setting $n=1$ in (19) and combining this with (3) and the invariance of $g^{F}$ under sufficient statistics [1, Thm. 2.1], since these give

[TABLE]

where $\tilde{u}_{1}\in T\mathcal{N}_{1}$ is the tangent vector to $\mathcal{N}_{1}$ corresponding to $u\in T\Theta$ . So Claim 2 is equivalent to

[TABLE]

for all tangent vectors $\tilde{u},\tilde{v}\in T\mathcal{M}$ , even if they have different base-points.

Now, fix $\tilde{u}$ to be some non-zero vector with $h^{F}(\tilde{u})=1$ , and let $c=h(\tilde{u})$ . Note that $c>0$ because $g$ is an inner product on each tangent space so the norm of any non-zero tangent vector is strictly positive. Then for any non-zero $\tilde{v}$ , $h^{F}(\tilde{v}/h^{F}(\tilde{v}))=h^{F}(\tilde{v})/h^{F}(\tilde{v})=1$ by the bilinearity of $g^{F}$ . So $h^{F}(\tilde{u})=h^{F}(\tilde{v}/h^{F}(\tilde{v}))$ and hence, by (27), $h(\tilde{u})=h(\tilde{v}/h^{F}(\tilde{v}))$ . Therefore $c=h(\tilde{u})=h(\tilde{v}/h^{F}(\tilde{v}))=h(\tilde{v})/h^{F}(\tilde{v})$ by the bilinearity of $g$ , so rearranging this equation proves the claim for all non-zero tangent vectors $\tilde{v}\in T\mathcal{M}$ . But the claim holds trivially for any zero tangent vector $\tilde{v}$ , since $0=h(\tilde{v})=h^{F}(\tilde{v})$ by the bilinearity of $g$ and $g^{F}$ , so the claim is proved.

The theorem now follows from Claim 3 and by (9), (11) and the analogous equations for the Fisher information metrics $g^{F}$ , $g^{nF}$ and $g_{n}^{F}$ , which hold by [1, eq. 4.2 and Thm. 2.1]. ∎

6 Extensions to higher-order symmetric tensors

The proof of Theorem 1 extends with almost no changes to characterise symmetric, order- $k$ tensors $\hat{g}$ and $\hat{g}_{n}$ on $\mathcal{M}$ and $\mathcal{N}_{n}$ , respectively, that satisfy conditions closely analogous to assumptions (A1)–(A3) of Section 4. Given such tensors $\hat{g}_{n}$ , define $\hat{h}_{n}(\tilde{u}_{n})=\sqrt[k]{\hat{g}_{n}(\tilde{u}_{n},\dots,\tilde{u}_{n})}$ , where there are $k$ copies of $\tilde{u}_{n}$ in the right-hand side of this equation. Assume that

[TABLE]

which is a generalisation of (9) from $k=2$ to general $k$ . Then as in the proof of Theorem 1, $\hat{h}_{n}(\tilde{u}_{n})=\sqrt{n}\hat{h}_{1}(\tilde{u}_{1})$ and $\hat{h}_{n}(\alpha\tilde{u}_{n})=\alpha\hat{h}_{n}(\tilde{u}_{n})$ for any $\alpha\geq 0$ (by (28) and the multi-linearity of $\hat{g}_{n}$ ). So with $\hat{h}$ in place of $h$ , the proof of Theorem 1 implies that $\hat{h}(\tilde{u})=c\;h^{F}(\tilde{u})$ for some $c\in\mathbb{R}$ , where $h^{F}$ is the norm of the Fisher information metric. Raising this equation to the power of $k$ gives

[TABLE]

If $k$ is odd then the left-hand side is an odd function of $\tilde{u}$ (i.e. it changes sign when $\tilde{u}$ is replaced by $-\tilde{u}$ ) while the right-hand side is an even function, which is a contradiction unless both sides vanish, so $c=0$ . If $k$ is even, then since $\hat{g}$ is determined by (29) (by the polarisation formula for symmetric tensors), $\hat{g}$ must be a constant times the symmetric part of $(g^{F})^{k/2}$ . For example, when $k=4$ then there is some $c^{\prime}\in\mathbb{R}$ so that

[TABLE]

for any $\tilde{u},\tilde{v},\tilde{w},\tilde{m}\in T\mathcal{M}$ .

Remark 7.

It might also be possible to adapt the proof of Theorem 1 to characterise the higher-order Amari-Chentsov tensors, which are symmetric, order- $k$ tensors that coincide with the Fisher information metric when $k=2$ and in general are given by an equation similar to (3), e.g. see [2, eq. 2.4] for the $k=3$ case. Claim 1 in the proof of Theorem 1 does not seem to hold for these tensors in general. However, if we replace the $k/2$ in (28) by other powers and strengthen the weak continuity condition on $H$ then it might be possible to replace Claim 1 by $\hat{h}(\tilde{u})=H(K\Phi,fK\Phi)$ , where $K$ is an Edgeworth polynomial (see [4] or [12, §4.5]). Then a symmetry argument, similar to the one in the proof of Theorem 1, should give the desired characterisation.

7 Discussion

Our version of Chentsov’s theorem characterises the Fisher information metric as the unique Riemannian metric (up to rescaling) on an exponential family $\mathcal{M}$ which is invariant under IID extensions and canonical sufficient statistics. We proved this by considering metrics $g$ on $\mathcal{M}$ , $g^{n}$ on the $n$ -fold IID extension $\mathcal{M}^{n}$ of $\mathcal{M}$ , and $g_{n}$ on the natural exponential family $\mathcal{N}_{n}$ corresponding to $\mathcal{M}^{n}$ . Then, under the above invariance conditions, $g$ can be calculated in terms of $g_{n}$ , for any $n$ . But for large $n$ , the central limit theorem and a property (16) of exponential families imply that $\mathcal{N}_{n}$ consists of distributions which are all approximately normally distributed, so each distribution in $\mathcal{N}_{n}$ is determined to a good approximation by its mean and variance-covariance matrix. Further, each tangent vector to $\mathcal{N}_{n}$ is essentially a linear function $f$ times a distribution in $\mathcal{N}_{n}$ . Combining these facts shows that (the norm corresponding to) $g$ is approximately equal to a simple function of $f$ and the mean and variance-covariance matrix of the relevant distribution in $\mathcal{N}_{n}$ . Our regularity condition implies that this approximation becomes exact in the limit as $n\to\infty$ . Then our main result follows from an identity (26) relating the variance-covariance matrix to the Fisher information metric on an exponential family.

In general, Chentsov’s theorem characterizes the Fisher information metrics of statistical models as the only Riemannian metrics (up to rescaling) that are invariant under certain, statistically important transformations. Previous studies have taken these transformations to be either all sufficient statistics or a large, regular subset of these. By contrast, we take these statistically important transformations to be the IID extensions and canonical sufficient statistics. This class of transformations is arguably more natural than the class of all sufficient statistics, it is more appropriate for exponential families and it is a relatively small class so our invariance assumptions are weaker than those of previous studies. Our regularity assumptions also appear to be weaker than previous studies, ultimately due to the fact that our approach only requires us to study a collection of finite-dimensional models, rather than an infinite-dimensional model.

We have given a new characterisation of the Fisher information metric on an exponential family and we have shown that this result is an intuitive consequence of the central limit theorem. The main limitation of this paper is that our main result is only proved for exponential families. However, exponential families are an important class of statistical models, being well studied and widely used in applications. Also, our proof treats discrete and continuous models in a uniform way, so there is some hope that our approach can be adapted to give a proof of Chentsov’s theorem for general statistical models. Lastly, our focus on exponential families complements the focus of Bauer et al. [7] on diffeomorphism-invariant metrics, since (curved) exponential families are essentially the only statistical models which have smooth sufficient statistics that are not diffeomorphisms, by the Pitman–Koopman–Darmois theorem [5].

Appendix A The invariant and parameterisation-dependent definitions of the Fisher information metric coincide

This section proves (in the notation of Section 2) that the invariant definition (3) of the Fisher information metric reduces to the usual parameterisation-dependent definition given by (31), below.

Given any tangent vectors $u=(\theta,a)$ and $v=(\theta,b)$ in $T_{\theta}\Theta$ , let $\tilde{u}=(P,A)$ and $\tilde{v}=(P,B)$ be the corresponding tangent vectors in $T_{P}\mathcal{M}$ , where $P=p_{\theta}\mu$ . Then by (2),

[TABLE]

so $dA/dP=\sum_{i=1}^{d}a_{i}(\partial/\partial\theta_{i})\log p_{\theta}$ , and similarly for $dB/dP$ . Substituting these into (3) gives

[TABLE]

where $\bar{g}^{F}_{\theta}$ is the $d\times d$ matrix with $(i,j)^{th}$ entry

[TABLE]

for any $i,j=1,\dots,d$ . Therefore the invariant definition (3) reduces to the usual, parameterisation-dependent definition (31) for the Fisher information metric [1, eq. 2.6].

Remark 8.

The metric $\bar{g}^{F}$ on $\Theta$ is just the pull-back of the metric $g^{F}$ on $\mathcal{M}$ via the parameterisation map $\Theta\to\mathcal{M}$ .

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amari and Nagaoka [2000] S. Amari and H. Nagaoka. Methods of Information Geometry , volume 191 of Translations of mathematical monographs . American Mathematical Society, Providence, 2000.
2Ay et al. [2015] N. Ay, J. Jost, H. Vân Lê, and L. Schwachhöfer. Information geometry and sufficient statistics. Probab. Theory Relat. Fields , 162:327–364, 2015.
3Barndorff-Nielsen [1978] O. Barndorff-Nielsen. Information and exponential families . John Wiley & Sons, New York, 1978.
4Barndorff-Nielsen and Cox [1979] O. Barndorff-Nielsen and D. R. Cox. Edgeworth and saddle-point approximations with statistical applications. Journal of the Royal Statistical Society Series B (Methodological) , 41(3):279–312, 1979.
5Barndorff-Nielsen and Pedersen [1968] O. Barndorff-Nielsen and K. Pedersen. Sufficient data reduction and exponential families. Math. Scand. , 22:197–202, 1968.
6Barron et al. [1998] A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory , 44(6):2743–2760, October 1998.
7Bauer et al. [2016] M. Bauer, M. Bruveris, and P. W. Michor. Uniqueness of the Fisher-Rao metric on the space of smooth densities. Bull. London Math. Soc. , 48:499–506, 2016.
8Bogachev [2007] V. I. Bogachev. Measure Theory, Volume I . Springer, Berlin, 2007.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Chentsov’s theorem for exponential families

Abstract

1 Introduction

2 The Fisher information metric

3 Exponential families and their derived families

3.1 Exponential families

3.2 IID extensions

3.3 Natural exponential families

4 Invariance and regularity conditions

Assumptions**.**

4.1 A1: Invariance under IID extensions

4.2 A2: Invariance under canonical sufficient statistics

4.3 A3: Calculability of norms by a function that satisfies a weak continuity condition

4.4 Remarks on the assumptions

Remark 1**.**

Remark 2**.**

Remark 3**.**

Remark 4**.**

Remark 5**.**

Remark 6**.**

5 The main theorem

Theorem 1**.**

Proof.

6 Extensions to higher-order symmetric tensors

Remark 7**.**

7 Discussion

Appendix A The invariant and parameterisation-dependent definitions of the Fisher information metric coincide

Remark 8**.**

Assumptions.

Remark 1.

Remark 2.

Remark 3.

Remark 4.

Remark 5.

Remark 6.

Theorem 1.

Remark 7.

Remark 8.