Information-geometrical characterization of statistical models which are statistically equivalent to probability simplexes
Hiroshi Nagaoka

TL;DR
This paper characterizes statistical models equivalent to probability simplexes using information geometry concepts like alpha-families and connections, deepening understanding of their geometric structure.
Contribution
It provides a geometric characterization of models equivalent to probability simplexes via alpha-families and related information geometric structures.
Findings
Characterization of models equivalent to probability simplexes
Connection to alpha-families, exponential, and mixture families
Insights into alpha-connections and autoparallelity in information geometry
Abstract
The probability simplex is the set of all probability distributions on a finite set and is the most fundamental object in the finite probability theory. In this paper we give a characterization of statistical models on finite sets which are statistically equivalent to probability simplexes in terms of -families including exponential families and mixture families. The subject has a close relation to some fundamental aspects of information geometry such as -connections and autoparallelity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Algebra and Logic · Statistical Mechanics and Entropy · Advanced Topology and Set Theory
Information-geometrical characterization of statistical models which are
statistically equivalent to probability simplexes
Hiroshi Nagaoka
Graduate School of Informatics and Engineering
The University of Electro-Communications
Chofu, Tokyo 182-8585, Japan
Email: [email protected]
Abstract
The probability simplex is the set of all probability distributions on a finite set and is the most fundamental object in the finite probability theory. In this paper we give a characterization of statistical models on finite sets which are statistically equivalent to probability simplexes in terms of -families including exponential families and mixture families. The subject has a close relation to some fundamental aspects of information geometry such as -connections and autoparallelity.
I An introductory example
Let and let be the set of probability distributions on of the form
[TABLE]
The statistical model has the following three properties. Firstly, it is a mixture family since
[TABLE]
Secondly, it is an exponential family since
[TABLE]
where , and . Lastly, is statistically equivalent to the 1-dimensional open probability simplex in the sense that there exist a channel from to and a channel from to such that is the set of output distributions of for input distributions in and that is invertible by . The matrix representations of these channels are given by
[TABLE]
Note that the invertibility holds.
Our aim is to show the equivalence between the first two properties and the last one.
II Statement of the main result
We begin with giving some basic definitions which are necessary to state our problem.
For an arbitrary finite set , let and be the sets of probability distributions and of strictly positive probability distributions on ;
[TABLE]
In particular, let for an arbitrary positive integer
[TABLE]
which we call the -dimensional (closed and open) probability simplexes.
A mapping , where and are finite sets, is called a Markov map when there exists a channel from to such that, for any ,
[TABLE]
i.e., is the output distribution of the channel corresponding to the input distribution . Note that a Markov map is affine; for and .
Let and be smooth submanifolds (statistical models) of and , respectively. When there exist a pair of Markov maps and such that their restrictions and are bijections between and and are the inverse mappings of each other, we say that and are Markov equivalent or statistically equivalent and wite as .
The aim of this paper is to give a characterization of statistical models which are statistically equivalent to probability simplexes. The main result is as follows.
Theorem 1 For an arbitrary smooth submanifold of , the following conditions are mutually equivalent.
- (i)
, where .
- (ii)
is an exponential family and is a mixture family.
- (iii)
, is an -family and is an -family.
- (iv)
, is an -family.
Explanation of exponential family, mixture family and -family for arbitrary as well as the proof of the theorem will be presented in subsequent sections. Here we only give a few remarks on condition (i). Firstly, (i) is equivalent to the condition that , since if then and must be diffeomorphic, so that . Secondly, (i) is equivalent to the condition , where denotes the topological closure of , and means that is the set of output distributions of an invertible (erro-free) channel.
III Some facts about condition (i)
From the definition of the relation , condition (i) implies that there exist and satisfying (the identity on ). Let be defined by
[TABLE]
where is the delta distributions on concentrated on . Then it is easy to see, as is shown in Lemma 9.5 and its “Supplement” of [1] where our is called a congruent embedding (of into ), that the supports constitute a partition of in the sense that
[TABLE]
and the left inverse of is represented as
[TABLE]
where . In addition, condition (i) implies , so that from (1) we have
[TABLE]
Conversely, if a statistical model is represented in the form (4) by a collection of distributions on whose supports constitute a partition of , then we see that satisfies condition (i) by defining and by (1) and (3). Thus a necessary and sufficient condition for (i) is obtained, which will be used in later arguments to prove the theorem.
IV -family, e-family and m-family
Following the way developed in [5] (see also [3, 4]), we give the definition of -family, which includes that of exponential family and mixture family as special cases.
For an arbitrary , define a function by111 can be replaced with by arbitrary constants and , possibly depending on . In [3, 4, 5], these constants are properly chosen so that the -duality and the limit of can be treated in a convenient way.
[TABLE]
The function is naturally extended to a mapping () by
[TABLE]
For a submanifold of , its denormalization is defined by
[TABLE]
where denotes the function . The denormalization is an extended manifold obtained by relaxing the normalization constraint . Obviously, is a submanifold of , and is an open subset of . When the image
[TABLE]
forms an open subset of an affine subspace, say , of , is called an -family. In this paper, it is assumed for simplicity that is maximal in the sense that
[TABLE]
Since it follows from the definition (5) of that
[TABLE]
(8) is written as
[TABLE]
Note that, as is pointed out in section 2.6 of [4], an affine subspace satisfying (9) must be a linear subspace when . Note also that is an -family for , corresponding to the case when .
When , the notion of -family is equivalent to that of exponential family, whose general form is such that
[TABLE]
where are functions on and is a function on defined by
[TABLE]
When , on the other hand, the notion of -family is equivalent to that of mixture family, whose general form is such that
[TABLE]
where are functions on satisfying
and .
When , the general form of -family is
[TABLE]
See §2.6 of [4] for further details.
V Proof of (i) (iv)
Assume (i), which implies that there exists a collection of probability distributions whose supports constitute a partition of and that is represented as (4). Then the denormalization is represented as
[TABLE]
Let be an arbitrary real number such that . Since in this case, it follows from the disjointness of the supports of that
[TABLE]
for any . From this we have
[TABLE]
where is the -dimensional linear subspace of spanned by , . This proves that is an -family for any .
Let . For any , we have
[TABLE]
where denotes the element of such that . Letting be defined by , we have
[TABLE]
which is an affine subspace of . This proves that is a -family (an exponential family).
The implication (i) (iv) has thus been proved.
VI Equivalence of (ii), (iii) and (iv)
The implications (iv) (ii) (iii) are obvious. To see (iii) (iv), some results of information geometry are invoked.
Remark 1: The notion of affine connections appears only in this section. Since the implication (ii) (i) will be proved in the next section without using affine connections (at least explicitly), we do not need them in proving the equivalence of the conditions of Theorem 1 except for (iii).
We first introduce some concepts from general differential geometry. Let be a smooth manifold and denote by the set of smooth vector fields on . Here, by a vector field on we mean a mapping, say , such that , where denotes the tangent space of at . An affine connection on is represented by a mapping , which is called a covariant derivative, satisfying certain conditions. Let be a smooth submanifold of . Then is naturally defined on , so that is defined for any vector fields on . However, the value in this case is a mapping in general and is not a vector field on (i.e., ) unless
[TABLE]
When (15) holds for , is said to be autoparallel w.r.t. or -autoparallel in .
Let and be affine connection on for which there exists a real number satisfying222For arbitrary affine connections and , their affine combination always becomes an affine connection.
[TABLE]
If a submanifold is -autoparallel and -autoparallel, then it is also -autoparallel. This implication is obvious from and the autoparallelity condition (15), which will be invoked later.
As was independently introduced by Čencov [1] and Amari [2], a one-parameter family of affince connections, which are called the -connections (), are defined on a manifold of probability distributions. After Amari’s notation, the -connection is written in the form of affine combination
[TABLE]
which implies that
[TABLE]
for any such that .
When a submanifold of is autoparallel w.r.t. the -connection in , we say that is -autoparallel in . Since (18) is of the form (16), it follows that if is -autoparallel and -autoparallel in for some , then it is -autoparallel in for all . On the other hand, it was shown in [5] (see also section 2.6 of [4]) that, for any submanifold of and for any real number , is an -family if and only if is -autoparallel in . Combination of these two results proves (iii) (iv).
Remark 2: Since the e-connection and the m-connection are dual w.r.t. the Fisher information metric [3, 4, 5], condition (ii) is a special case of doubly autoparallelity introduced by Ohara; see [6, 7] and the reference cited there. It is pointed out in [7] that the -autoparallelity for all follows from that for .
VII Proof of (ii) (i)
Assume (ii), which means that there exist two affine subspaces and of such that
[TABLE]
[TABLE]
where and . Let and be the linear spaces of translation vectors of and , respectively, so that we have for any and for any 333Actually, is a linear space as mentioned in section IV, and therefore . .
Lemma 1
is closed w.r.t. multiplication of functions; i.e., , where the product is defined by .
Proof.
The map
[TABLE]
is a diffeomorphism from , which is an open subset of , onto . The differential map of at a point is defined by
[TABLE]
for any smooth curve in and is represented as
[TABLE]
This gives a linear isomorphism from onto . Therefore, for any two points , we can define
[TABLE]
This means that, for any and any , we have . For arbitrary and , let us define a map by
[TABLE]
Then its differential at a point is given by
[TABLE]
Composing this map with the inverse of
[TABLE]
we have
[TABLE]
This proves that . ∎
Lemma 2
contains the constant functions on .
Proof.
From the definition (7) of , for any and any positive constant , we have . This implies that both and belong to , and hence the translation belongs to . ∎
These two lemmas state that is a subalgebra of the commutative algebra with the unit element (: the constant function ) of contained in . From a well known result on such subalgebras444Although various mathematical extensions of this result including infinite-dimensional and/or noncommutative versions are known, the author of the present paper could find no appropriate reference describing the result for the finite-dimensional commutative case with an elementary proof. So, we give a proof in the appendix for the readers’ sake.
, it is concluded that there exists a partition of such that
[TABLE]
Let an element of be arbitrarily fixed. Then we have
[TABLE]
From (19), (21) and (22) and the disjointness of , we have
[TABLE]
where
[TABLE]
Then are probability distributions on whose supports are , and
[TABLE]
Since this is the same form as (4), condition (i) has been derived.
VIII Conclusion
We have shown Theorem 1 which gives an information-geometrical characterization of statistical models on finite sample spaces which are statistically equivalent to open probability simplexes . The statistical equivalence (also called the Markov equivalence) to probability simplexes played a crucial role in Čencov’s pioneering work [1] on information geometry, where the notions of Fisher information metric and the -connections were characterized in terms of the statistical equivalence. The present work shed another light on the relation between the statistical equivalence and information geometry.
Acknowledgment
This work was partially supported by JSPS KAKENHI Grant Number 16K00012.
Appendix
Proposition Let be a finite set and be a subalgebra of containing the constant functions. Then there exists a partition of such that
[TABLE]
Proof.
Let
[TABLE]
which is the totality of the level sets of functions in . We first show that, for any ,
[TABLE]
Since is obvious, it suffices to show . Assume , so that for some and . When is the empty set , we have . So we assume , which means that . Let the elements of be , where and if , and let . Then we have with . Let be a polynomial satisfying and for any . Explicitly, is expressed as
[TABLE]
It follows that
[TABLE]
In addition, belongs to since is a subalgebra of with . Hence we have .
Using (25), we see that
[TABLE]
as
[TABLE]
Properties (26)-(28) implies that is an additive class of sets (-algebra) on the finite entire set . Therefore, is generated by a partition of in the sense that every element of is the union of some (or no) elements of . Recalling the definition (24) of , we conclude (23).
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. N. Čencov (Chentsov), Statistical Decision Rules and Optimal Inference , AMS, 1982 (original Russian edition: Nauka, Moscow, 1972).
- 2[2] S. Amari, “Differential geometry of curved exponential families—curvature and information loss”, The Annals of Statistics , 10, 357–385, 1982.
- 3[3] S. Amari, Differential-Geometrical Methods in Statistics , Springer, Lecture Notes in Statistics 28, 1985.
- 4[4] S. Amari and H. Nagaoka, Methods of information geometry , AMS & OUP, 2000.
- 5[5] H. Nagaoka and S. Amari, “Differential geometry of smooth families of probability distributions”, Technical Report METR 82-7, Dept. of Math. Eng. and Instr. Phys, Univ. of Tokyo, 1982. (http://www.keisu.t.u-tokyo.ac.jp/research/techrep/data/1982/METR 82-07.pdf)
- 6[6] A. Ohara, “Information geometric analysis of an interior point method for semidefinite program- ming”, Geometry in Present Day Science (eds. O. E. Barndorff-Nielsen and E. B. V. Jensen), pp.49-74, World Scientific, 1999.
- 7[7] A. Ohara, “Geodesics for dual connections and means on symmetric cones”, Integr. equ. oper. theory , 50, 537–548, 2004.
