
TL;DR
This paper extends Chentsov's theorem to exponential families, showing the Fisher information metric is uniquely invariant under key statistical transformations, using a unified, less technical proof based on the central limit theorem.
Contribution
It proves a version of Chentsov's theorem for exponential families, characterizing the Fisher information metric as the unique invariant Riemannian metric under specific transformations.
Findings
Fisher information metric is uniquely invariant in exponential families
Unified proof applicable to both discrete and continuous cases
Simplifies previous approaches using the central limit theorem
Abstract
Chentsov's theorem characterizes the Fisher information metric on statistical models as essentially the only Riemannian metric that is invariant under sufficient statistics. This implies that each statistical model is naturally equipped with a geometry, so Chentsov's theorem explains why many statistical properties can be described in geometric terms. However, despite being one of the foundational theorems of statistics, Chentsov's theorem has only been proved previously in very restricted settings or under relatively strong regularity and invariance assumptions. We therefore prove a version of this theorem for the important case of exponential families. In particular, we characterise the Fisher information metric as the only Riemannian metric (up to rescaling) on an exponential family and its derived families that is invariant under independent and identically distributed extensions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Chentsov’s theorem for exponential families
James G. Dowty
Abstract
Chentsov’s theorem characterizes the Fisher information metric on statistical models as essentially the only Riemannian metric that is invariant under sufficient statistics. This implies that each statistical model is naturally equipped with a geometry, so Chentsov’s theorem explains why many statistical properties can be described in geometric terms. However, despite being one of the foundational theorems of statistics, Chentsov’s theorem has only been proved previously in very restricted settings or under relatively strong regularity and invariance assumptions. We therefore prove a version of this theorem for the important case of exponential families. In particular, we characterise the Fisher information metric as the only Riemannian metric (up to rescaling) on an exponential family and its derived families that is invariant under independent and identically distributed extensions and canonical sufficient statistics. Our approach is based on the central limit theorem, so it gives a unified proof for both discrete and continuous exponential families, and it is less technical than previous approaches.
1 Introduction
Chentsov’s theorem is a foundational theorem in statistics that characterizes the Fisher information metric on statistical models as the only Riemannian metric (up to rescaling) that is invariant under certain, statistically important transformations [10, 16, 9, 2, 7]. This effectively means that the Fisher information metric is the only natural metric on a statistical model, so many statistical properties of these models should be describable in terms of this metric. Known examples of this correspondence between statistical and geometric properties include: the Cramér-Rao lower bound for the variance of an unbiased estimator in terms of the inverse of the Fisher information metric [1, Thm. 2.2]; orthogonality as a criterion for first-order efficiency of estimators [1, Thm. 4.3]; the central role of statistical curvature in the information loss of an efficient estimator [12, §3.3] and in second-order efficiency [12, §3.4]; and the spontaneous emergence of the Fisher information volume [15] in the minimum description length (MDL) approach to statistical model selection [6].
The original version of Chentsov’s theorem [10, 16, 9] only applied in the restricted setting of statistical models with finite data spaces. This version of the theorem says that the Fisher information metric is the only metric (up to a multiplicative constant) that is defined on all models with finite data spaces and is invariant under all sufficient statistics. Recall that a statistical model is a (sufficiently regular) set of probability measures on the same measurable space , which we call the data space of , and that a sufficient statistic for is a function on for which the conditional distribution of any measure in , given the sufficient statistic, is the same for all . Sufficient statistics induce corresponding maps on statistical models (the measure-theoretic push-forward maps) and the invariance assumption above is that all of these maps are isometries (i.e., distance-preserving maps).
Since the assumption of finite data spaces is very restrictive, Ay et al. [2] proved a version of Chentsov’s theorem that applies to models whose data spaces are smooth manifolds. Their version says that the Fisher information metric is the only metric (up to rescaling) that is defined on all statistical models with a given data space and is invariant under all sufficient statistics, including discontinuous ones. This version of Chentsov’s theorem applies to many interesting statistical models but it makes very strong assumptions about both the breadth of the models on which the metrics are defined and the invariance properties of these metrics. Therefore Bauer et al. [7] proved a version of Chentsov’s theorem which says the Fisher information metric is the only metric (up to rescaling) that firstly is defined on the space of all smooth, positive densities on a compact manifold of dimension or higher and secondly is invariant under all diffeomorphisms from to itself (where diffeomorphisms are smooth maps with smooth inverses, so they are a special type of sufficient statistic). The proof of Bauer et al. [7] was based on results from the theory of generalized functions, especially the Schwartz kernel theorem [11, §6.1], and it made far weaker invariance assumptions than that of Ay et al. [2]. The assumption that is a compact manifold without boundary excludes many cases of interest to statisticians, though Bauer et al. [7] say this assumption can be weakened.
Despite their beauty and generality, the results of Ay et al. [2] and Bauer et al. [7] leave open the possibility that there might exist a natural metric other than the Fisher information metric on an individual statistical model . This could occur, for example, if there is a natural metric on that does not (invariantly) extend to a metric on the infinite-dimensional models of [2] and [7] that contain and many unrelated models. Also, exponential families have a distinguished, finite-dimensional set of sufficient statistics, called the canonical sufficient statistics, which are related to their natural affine structures ([1, Thm. 2.4] and [3, Lemma 8.1]). Therefore, the invariance assumptions of [2] and [7] are arguably too strong for exponential families, and instead it would be more natural to consider invariance under canonical sufficient statistics rather than all sufficient statistics.
In this paper, we prove a refined version of Chentsov’s theorem in the important case of exponential families. Instead of considering metrics defined on an infinite-dimensional statistical model, as in [2] and [7], we consider metrics defined only on a given exponential family and some of its derived families, namely its independent and identically distributed (IID) extensions and their corresponding natural exponential families. Instead of assuming these metrics are invariant under all sufficient statistics or all diffeomorphisms, we assume invariance under canonical sufficient statistics and IID extensions. This assumption of invariance under IID extensions has no analogue in previous work, but IID extensions are natural and important transformations between statistical models (perhaps more so than sufficient statistics), so this invariance assumption is arguably more natural than invariance under sufficient statistics. Also, this extra invariance assumption is offset by the fact that we restrict our sufficient statistics to the canonical ones. Then, under a mild regularity condition, we prove that metrics with these invariance properties are multiples of the Fisher information metric (see Theorem 1 in Section 5). This result therefore gives a new characterisation of the Fisher information metric as the only metric on an exponential family and its derived families that is invariant under canonical sufficient statistics and IID extensions.
Our approach has a number of advantages: as discussed above, we only assume that the metric is defined on an individual model and its related models, and our invariance assumptions respect the natural affine structures of exponential families; we only consider metrics on a collection of finite-dimensional models (similar to the original version of Chentsov’s theorem [10, 16, 9]), which allows us to avoid the technicalities encountered in [2] and [7] because of the infinite-dimensionality of their statistical models; our proof is unified for discrete and continuous distributions, unlike the proofs of [10, 16, 9] and [7], so there is some hope of extending our proof to general statistical models; our proof shows that Chentsov’s theorem is a corollary of the central limit theorem, which makes this result more understandable and intuitive; and our results complement those of [7], since (curved) exponential families are essentially the only statistical models with smooth sufficient statistics that are not diffeomorphisms, by the Pitman–Koopman–Darmois theorem [5].
The rest of this paper is set out as follows. In Section 2 we define the Fisher information metric and some relevant notions from differential geometry, as they apply in our main case of interest. In Section 3 we briefly recall the definition of an exponential family and some of its derived families. We then give precise descriptions of our assumptions in Section 4, before using these assumptions and the central limit theorem to prove our characterisation of the Fisher information metric in Section 5. We then describe an extension of our proof to higher-order symmetric tensors in Section 6, before finishing with a discussion of our results in Section 7. Section 7 also begins with a non-technical summary of our proof.
2 The Fisher information metric
This section briefly recalls the definitions of tangent vectors and the Fisher information metric of a statistical model. A general reference for the notions from Riemannian geometry described here is [12, Appendix C].
In all later sections of this paper, we will take to be a regular exponential family with natural parameter space , but in this section we let be a more general statistical model and let be any smooth parameter space for . More precisely, suppose is an open subset of and that is a measure on with support . Then our statistical model is , where each is a -integrable, strictly positive function that is normalized, meaning . Note that is a set of probability measures on . We assume that the parameterisation of by is smooth, in the sense that is a smooth (i.e., infinitely differentiable) function for -almost all . We also assume that the parameterisation is non-singular, meaning that the parameterisation map given by is injective and that it maps non-zero tangent vectors to non-zero tangent vectors, in a sense which will become clear below.
Because is an open subset of , any tangent vector to is a pair for some and some , where is called the base-point of . The set of all such tangent vectors, which is denoted and is called the tangent bundle of , is therefore . The tangent bundle is not a vector space in general, but the set of all tangent vectors with the same base-point is. The vector space consisting of all vectors with base-point is called the tangent space to at . Addition and scalar multiplication in this vector space are given by
[TABLE]
for any and any , where and . Note that addition and scalar multiplication in effectively ignore the shared base-point .
Similarly, we can view each tangent vector to the statistical model as a pair , where the base-point is an element of the model and is essentially the score in a particular direction [14, §3.3]. More precisely, for each tangent vector to , there is a corresponding tangent vector to given by
[TABLE]
(The function taking to is the differential of the parameterisation [12, Def. C.3.4].) Let the tangent bundle of be the set of all such tangent vectors, i.e., let . Also, let the tangent space to at be the vector space consisting of all tangent vectors with base-point . Even though we have used a particular parameterisation of to define , this tangent space is natural, in the sense that is the same for all smooth parameterisations.
The Fisher information metric on is given by
[TABLE]
for any tangent vectors and in the tangent space [7, §3], where and are Radon-Nikodym derivatives [8, §3.2]. It is straightforward (see Appendix A) to show that definition (3) for the Fisher information metric reduces to the usual, parameterisation-dependent definition [1, eq. 2.6]. However, the formulation (3) will be more useful to us than the usual definition. Also, because (3) is phrased only in terms of natural constructions, this formula makes it clear that does not depend on arbitrary choices, such as the choice of parameterisation.
A Riemannian metric on a set is just a function that puts an inner product on each of the set’s tangent spaces (if the set is suitably regular and the inner products vary smoothly with the base-point). For example, a Riemannian metric on can be thought of as a smooth, matrix-valued function on whose value at is a , symmetric, positive definite matrix , since this defines an inner product on each with the inner product of any being , where and .
In our main case of interest, where is an exponential family, the integral in (3) always converges [12, Thm. 2.2.5]. Then it is not hard to see that (3) defines an inner product on each tangent space to (and this varies smoothly with the base-point), so the Fisher information metric is a Riemannian metric on .
3 Exponential families and their derived families
Partly to establish our notation, this section briefly recalls the definitions of an exponential family, its IID extensions and their corresponding natural exponential families.
3.1 Exponential families
Let be a measure on and let be a measurable function, where is the support of . Let
[TABLE]
where the dot () denotes the Euclidean inner product on . For each , define by
[TABLE]
for any , where is the partition function . Assume that is a non-empty, open subset of and that is full rank, in the sense that the image of is not contained in any -dimensional hyperplane in . Then is a regular exponential family of order with dominating measure and canonical sufficient statistic , and all regular exponential families are of this form [3, §8.1]. Note that each element of is a probability measure on .
3.2 IID extensions
The -fold IID extension of is the set of all measures of the form for some , where (with copies of ) is the product measure on [8, §3.3]. In terms of the parameterisation (4), is the set of all measures of the form for some , where is given by and is the product measure on [3, Example 8.12(ii)]. So by (4),
[TABLE]
where is given by for any . Therefore is an exponential family with dominating measure and sufficient statistic (and natural parameter , see [12, Thm. 2.2.6]). Note that , and .
3.3 Natural exponential families
Recall that if and are measurable spaces, is a measurable function and is a measure on then the push-forward of via is the measure on given by
[TABLE]
for any measurable set in [8, §3.6]. This immediately implies that if is a -valued random variable with distribution then is a -valued random variable with distribution , which in symbols we write as
[TABLE]
Then the natural exponential family corresponding to and is the set of measures on . By [3, Examples 8.12(ii) and 8.12(iii)], , where is a measure on which does not depend on and is given by
[TABLE]
for any . The formula (8) shows that the superscript in is actually an exponent, so we will write for (and then the notation is unambiguous).
Note that even though and are families of measures on different spaces (namely, and , respectively), they are all parameterised by so they are all -dimensional families of measures.
4 Invariance and regularity conditions
Let , and be as in Section 3 and suppose now that these spaces have been equipped with Riemannian metrics , and , respectively. In this section, we will give precise conditions that formalize the notion of these metrics being invariant under IID extensions and canonical sufficient statistics, as well as giving a mild regularity condition. These conditions will then be used in Section 5 to prove our main theorem. See Section 4.4 for a number of remarks about these assumptions.
Assumptions**.**
We make the following assumptions, which are described precisely in the subsections below:
- A1
The metrics and are invariant under IID extensions (up to a factor of ) 2. A2
The metrics and are invariant under canonical sufficient statistics 3. A3
The norms corresponding to the metrics can all be calculated by a function that satisfies a weak continuity condition
4.1 A1: Invariance under IID extensions
Let be the function which maps each to the product measure (see Section 3.2). Then our first assumption is that this map is an isometry (i.e., distance-preserving map) up to a factor of .
More precisely, let be any tangent vector to , as in Section 2. Then similarly to (2), corresponds under the smooth parameterisation (5) to a tangent vector to , where , and . Let be the set of all such tangent vectors to . Then our first assumption is that
[TABLE]
for all tangent vectors with the same base-point. Here, and are the tangent vectors to and (respectively) corresponding to , as for above. Note that (9) just says that under the identification of with via .
The Fisher information metric is invariant under IID extensions in the sense of (9) by [1, eq. 4.2], so assumptions (A1)–(A3) cannot characterize the Fisher information metric unless the factor of is included in (9) (though see Remark 4).
4.2 A2: Invariance under canonical sufficient statistics
Let be the canonical sufficient statistic from Section 3.2 and let be the corresponding (measure-theoretic) push-forward map of , see Section 3.3. Then our second assumption is that this map is an isometry (and that all other canonical sufficient statistics are isometries, in a sense which will be made precise in Section 4.3).
More precisely, let be any tangent vector to , as in Section 2. Then similarly to (2), corresponds under the smooth parameterisation (8) to a tangent vector to , where
[TABLE]
Let be the set of all such tangent vectors. Then our second assumption is that
[TABLE]
for all tangent vectors with the same base-point. Here, and are the tangent vectors to and (respectively) corresponding to , as for above. Note that (11) just says that under the identification of with via .
4.3 A3: Calculability of norms by a function that satisfies a weak continuity condition
Let be the norm corresponding to , so for any . Note that determines by the polarisation formula,
[TABLE]
for any with the same base-point (which follows from the bilinearity of ), so any question about can be phrased in terms of . However, it will be more convenient to work with than , because is a function defined on , whereas is only defined on certain pairs of tangent vectors (those with the same base-point). Similarly, let be the norm corresponding to , so for any .
Let be the set of all pairs , where is a probability measure on and is a signed measure on , and note that for every . Then our regularity condition (A3) is, firstly, that there is subset of and a function so that, for each , (i.e. is defined on each ) and
[TABLE]
for every . In other words, we assume that there is some function whose restriction to each is the norm . For instance, we could take and then define by the requirement that (12) holds, which gives a well-defined whenever the functions agree on any overlaps between the spaces .
Further, we assume that has the following weak continuity property. Firstly, we require that is defined on all pairs of the form , where is the probability measure for the standard normal distribution on and is a linear function (with ). Secondly, we require that
[TABLE]
for any sequence of probability measures on for which is constant in , and each is standardized (i.e., has [math] mean and identity variance-convariance matrix), where is the value of the function at and means converges to in the sense of the weak convergence of measures [13, Def. 1.2.1]. This condition is an extremely weak form of continuity, see Remark 1.
Lastly, as a consequence of our assumption (A2) that the metrics should be invariant under all canonical sufficient statistics, we assume that is affine invariant (see Remark 6). Here, an invertible affine transformation of is a map of the form for some invertible matrix and some . The push-forward of any signed measure on is defined in a similar way to the push-forward of an (unsigned) measure, see (6). We define the push-forward of any to be . (In this notation, is the measure-theoretic push-forward, which is a map from the space of signed measures on to itself, and is the differential of this map if is interpreted as a tangent vector.) Then our condition that is affine invariant means that and
[TABLE]
for every and every invertible affine transformation of .
For future reference, we note that if is an invertible affine transformation, is a probability measure and is a -integrable, real-valued function then
[TABLE]
by the change of variables formula [8, Thm. 3.6.1].
4.4 Remarks on the assumptions
Remark 1**.**
Assumptions (A1) and (A2) say that the metrics on , and are invariant under a countable set of transformations and, in a certain sense, under the finite-dimensional group of affine transformations of . The third assumption (A3) is an extremely weak form of continuity. Firstly, this condition says that the norms agree on any overlaps between the spaces , so that these functions can be pieced together into a single function . Secondly, this condition says that if is linear and is a sequence for which all have the same norms then this shared norm must be . By comparison, full continuity of would require that for every sequence in that converges to (with respect to some notion of convergence). So our third assumption is the condition for the continuity of in the very special case where , is constant in , for every and is a linear function.
Remark 2**.**
Recent versions of Chentsov’s theorem [2, 7] consider metrics on infinite-dimensional statistical models that are invariant under infinite-dimensional sets of transformations. This infinite dimensionality introduces technical complications and it makes strong assumptions about both the space on which the metric is defined and its symmetries. By contrast, our approach allows us to only consider metrics on a collection of finite-dimensional models, as in the original version of Chentsov’s theorem [10, 16, 9]. This allows our characterisation of the Fisher information metric to be relatively free from technicalities and it allows us to make relatively weak invariance and regularity assumptions.
Remark 3**.**
It is not hard to see that the Fisher information metric satisfies assumptions (A1)–(A3). For it is well known that the Fisher information metric is invariant under both IID extensions (in the sense of (9)) and sufficient statistics [1, eq. 4.2 and Thm. 2.1]. Also, given any probability measure on , let , and let be the union of these spaces as ranges over the set of all probability measures on . Then by (3), the Fisher information norm of any is just the -norm of . So if is a linear function on , say for some , and is any standardized probability measure on then
[TABLE]
where is the Euclidean norm of . So for any sequence of standardized probability measures (whether weakly convergent to or not), , so satisfies the weak continuity condition (13). Lastly, this function is affine invariant (14) by the change of variables formula (15).
Remark 4**.**
In some ways the factor of in (9) is not essential, since we could instead formulate our assumptions and theorems in terms of the metrics and , in which case (9) would be equivalent to the equation that describes exact invariance under the map , rather than invariance up to a factor of (though as in (12) might not exist without the factor of ). However, it is natural to include the factor of in our formulation of IID invariance, firstly because the Fisher information metric is IID invariant in the sense of (9) [1, eq. 4.2], so assumptions (A1)–(A3) would not characterise the Fisher information metric without this factor, and secondly because the factor of arises from a natural construction from differential geometry (see Remark 5).
Remark 5**.**
Given an arbitrary Riemannian metric on , a natural construction from differential geometry gives a metric on the -fold IID extension of equal to the metric satisfying (9), as follows. The Cartesian product of with itself times is the space whose points are -tuples of measures on . Given such an -tuple, there is a corresponding product measure on , and conversely we can recover each from by marginalizing, so we can identify with the product measure on . This product measure is the joint distribution of independent random variables whose marginal distributions are , respectively. So if satisfies then is the joint distribution of IID random variables . Therefore we can identify the diagonal
[TABLE]
of with the -fold IID extension of . But a Riemannian metric on induces a Riemannian metric on the Cartesian product , and then inherits a metric from its super-manifold . Under the above identification between and , this metric is the metric on that satisfies (9).
Remark 6**.**
The canonical sufficient statistics for an exponential family are only unique up to affine transformations [3, Lemma 8.1], meaning that if is an invertible affine transformation of and is a canonical sufficient statistic then is also a canonical sufficient statistic (and every canonical sufficient statistic is of this form). Replacing by effectively replaces each tangent vector by , so (11), (12) and the analogous equations for imply for every . So since is arbitrary, is affine invariant.
5 The main theorem
We can now prove our version of Chentsov’s theorem. This theorem characterises the Fisher information metric as the only metric (up to rescaling) on an exponential family that is invariant under IID extensions and canonical sufficient statistics.
Let , and be the Fisher information metrics on , and , respectively.
Theorem 1**.**
Suppose that assumptions (A1)–(A3) of Section 4 hold. Then there is some so that , and for every integer .
Proof.
Let any integer and any be given, and let and be the corresponding distributions in and . By Theorem 2.2.6 of [12] and the comments preceding it, if are independent random variables all distributed according to then their mean is distributed as , which we write as
[TABLE]
Alternatively, it is not hard to prove (16), since if are IID and then are IID and , by (7) and since and (by definition), where . This proves (16) because and have the same joint distribution so their means have the same distribution, by another application of (7).
By (16), the mean for is the same as that for , i.e.
[TABLE]
and the variance-covariance matrix for is times that for , i.e.
[TABLE]
Now, let be any tangent vector to at , and define by for any . Here, is defined in the standard way via a diagonalisation of the symmetric, positive-definite matrix . As before, let and , respectively, be the tangents to and that correspond to under the parameterisations (5) and (8).
Claim 1: . By (8), (10) and the fact that is the gradient of at [12, Thm. 2.2.1], with and
[TABLE]
where and for any .
Let be the affine transformation on given by , and note that exists because is positive-definite. By (17) and (18), this choice of ensures that is standardised, i.e., that has mean [math] and variance-covariance matrix equal to the identity matrix. Note that depends on , so we could instead write this as , but for notational simplicity we will drop the subscript. Then by (15) and (19),
[TABLE]
where is as in the statement of the claim.
So recalling the notation , we have
[TABLE]
By (16), the central limit theorem (e.g. see [13, Cor. 8.1.10]) and the fact that is standardised, . Therefore,
[TABLE]
so the claim is proved.
Now, let be any tangent vector to , not necessarily with the same base-point as , and let be the corresponding tangent vector to .
Claim 2: implies . To prove this, assume that , i.e. that and have the same Euclidean norm. Then there exists a orthogonal matrix so that
[TABLE]
Also, because is orthogonal, so
[TABLE]
by (15), where is given by
[TABLE]
for any , by (23) and (since is orthogonal). So
[TABLE]
which proves Claim 2.
*Claim 3: There is some so that for all tangent vectors . * It is well-known [12, Thms. 2.2.1 and 2.2.5] that the Fisher information metric on the natural parameter space is the variance-covariance matrix of the corresponding sufficient statistic, so . Alternatively, this follows easily from setting in (19) and combining this with (3) and the invariance of under sufficient statistics [1, Thm. 2.1], since these give
[TABLE]
where is the tangent vector to corresponding to . So Claim 2 is equivalent to
[TABLE]
for all tangent vectors , even if they have different base-points.
Now, fix to be some non-zero vector with , and let . Note that because is an inner product on each tangent space so the norm of any non-zero tangent vector is strictly positive. Then for any non-zero , by the bilinearity of . So and hence, by (27), . Therefore by the bilinearity of , so rearranging this equation proves the claim for all non-zero tangent vectors . But the claim holds trivially for any zero tangent vector , since by the bilinearity of and , so the claim is proved.
The theorem now follows from Claim 3 and by (9), (11) and the analogous equations for the Fisher information metrics , and , which hold by [1, eq. 4.2 and Thm. 2.1]. ∎
6 Extensions to higher-order symmetric tensors
The proof of Theorem 1 extends with almost no changes to characterise symmetric, order- tensors and on and , respectively, that satisfy conditions closely analogous to assumptions (A1)–(A3) of Section 4. Given such tensors , define , where there are copies of in the right-hand side of this equation. Assume that
[TABLE]
which is a generalisation of (9) from to general . Then as in the proof of Theorem 1, and for any (by (28) and the multi-linearity of ). So with in place of , the proof of Theorem 1 implies that for some , where is the norm of the Fisher information metric. Raising this equation to the power of gives
[TABLE]
If is odd then the left-hand side is an odd function of (i.e. it changes sign when is replaced by ) while the right-hand side is an even function, which is a contradiction unless both sides vanish, so . If is even, then since is determined by (29) (by the polarisation formula for symmetric tensors), must be a constant times the symmetric part of . For example, when then there is some so that
[TABLE]
for any .
Remark 7**.**
It might also be possible to adapt the proof of Theorem 1 to characterise the higher-order Amari-Chentsov tensors, which are symmetric, order- tensors that coincide with the Fisher information metric when and in general are given by an equation similar to (3), e.g. see [2, eq. 2.4] for the case. Claim 1 in the proof of Theorem 1 does not seem to hold for these tensors in general. However, if we replace the in (28) by other powers and strengthen the weak continuity condition on then it might be possible to replace Claim 1 by , where is an Edgeworth polynomial (see [4] or [12, §4.5]). Then a symmetry argument, similar to the one in the proof of Theorem 1, should give the desired characterisation.
7 Discussion
Our version of Chentsov’s theorem characterises the Fisher information metric as the unique Riemannian metric (up to rescaling) on an exponential family which is invariant under IID extensions and canonical sufficient statistics. We proved this by considering metrics on , on the -fold IID extension of , and on the natural exponential family corresponding to . Then, under the above invariance conditions, can be calculated in terms of , for any . But for large , the central limit theorem and a property (16) of exponential families imply that consists of distributions which are all approximately normally distributed, so each distribution in is determined to a good approximation by its mean and variance-covariance matrix. Further, each tangent vector to is essentially a linear function times a distribution in . Combining these facts shows that (the norm corresponding to) is approximately equal to a simple function of and the mean and variance-covariance matrix of the relevant distribution in . Our regularity condition implies that this approximation becomes exact in the limit as . Then our main result follows from an identity (26) relating the variance-covariance matrix to the Fisher information metric on an exponential family.
In general, Chentsov’s theorem characterizes the Fisher information metrics of statistical models as the only Riemannian metrics (up to rescaling) that are invariant under certain, statistically important transformations. Previous studies have taken these transformations to be either all sufficient statistics or a large, regular subset of these. By contrast, we take these statistically important transformations to be the IID extensions and canonical sufficient statistics. This class of transformations is arguably more natural than the class of all sufficient statistics, it is more appropriate for exponential families and it is a relatively small class so our invariance assumptions are weaker than those of previous studies. Our regularity assumptions also appear to be weaker than previous studies, ultimately due to the fact that our approach only requires us to study a collection of finite-dimensional models, rather than an infinite-dimensional model.
We have given a new characterisation of the Fisher information metric on an exponential family and we have shown that this result is an intuitive consequence of the central limit theorem. The main limitation of this paper is that our main result is only proved for exponential families. However, exponential families are an important class of statistical models, being well studied and widely used in applications. Also, our proof treats discrete and continuous models in a uniform way, so there is some hope that our approach can be adapted to give a proof of Chentsov’s theorem for general statistical models. Lastly, our focus on exponential families complements the focus of Bauer et al. [7] on diffeomorphism-invariant metrics, since (curved) exponential families are essentially the only statistical models which have smooth sufficient statistics that are not diffeomorphisms, by the Pitman–Koopman–Darmois theorem [5].
Appendix A The invariant and parameterisation-dependent definitions of the Fisher information metric coincide
This section proves (in the notation of Section 2) that the invariant definition (3) of the Fisher information metric reduces to the usual parameterisation-dependent definition given by (31), below.
Given any tangent vectors and in , let and be the corresponding tangent vectors in , where . Then by (2),
[TABLE]
so , and similarly for . Substituting these into (3) gives
[TABLE]
where is the matrix with entry
[TABLE]
for any . Therefore the invariant definition (3) reduces to the usual, parameterisation-dependent definition (31) for the Fisher information metric [1, eq. 2.6].
Remark 8**.**
The metric on is just the pull-back of the metric on via the parameterisation map .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Amari and Nagaoka [2000] S. Amari and H. Nagaoka. Methods of Information Geometry , volume 191 of Translations of mathematical monographs . American Mathematical Society, Providence, 2000.
- 2Ay et al. [2015] N. Ay, J. Jost, H. Vân Lê, and L. Schwachhöfer. Information geometry and sufficient statistics. Probab. Theory Relat. Fields , 162:327–364, 2015.
- 3Barndorff-Nielsen [1978] O. Barndorff-Nielsen. Information and exponential families . John Wiley & Sons, New York, 1978.
- 4Barndorff-Nielsen and Cox [1979] O. Barndorff-Nielsen and D. R. Cox. Edgeworth and saddle-point approximations with statistical applications. Journal of the Royal Statistical Society Series B (Methodological) , 41(3):279–312, 1979.
- 5Barndorff-Nielsen and Pedersen [1968] O. Barndorff-Nielsen and K. Pedersen. Sufficient data reduction and exponential families. Math. Scand. , 22:197–202, 1968.
- 6Barron et al. [1998] A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory , 44(6):2743–2760, October 1998.
- 7Bauer et al. [2016] M. Bauer, M. Bruveris, and P. W. Michor. Uniqueness of the Fisher-Rao metric on the space of smooth densities. Bull. London Math. Soc. , 48:499–506, 2016.
- 8Bogachev [2007] V. I. Bogachev. Measure Theory, Volume I . Springer, Berlin, 2007.
