Gibbs posterior convergence and the thermodynamic formalism
Kevin McGoff, Sayan Mukherjee, Andrew Nobel

TL;DR
This paper develops a Bayesian inference framework using Gibbs posteriors for dynamical systems, analyzing their asymptotic behavior and establishing connections with thermodynamic formalism to enhance understanding of dependent process inference.
Contribution
It introduces a Gibbs posterior approach for dynamical systems, characterizes its asymptotic behavior, and links Bayesian inference with thermodynamic formalism for dependent processes.
Findings
Gibbs posteriors concentrate around solutions of a variational problem.
Posterior consistency can be established for properly specified models.
Connections between Bayesian inference and thermodynamic formalism are demonstrated.
Abstract
In this paper we consider a Bayesian framework for making inferences about dynamical systems from ergodic observations. The proposed Bayesian procedure is based on the Gibbs posterior, a decision theoretic generalization of standard Bayesian inference. We place a prior over a model class consisting of a parametrized family of Gibbs measures on a mixing shift of finite type. This model class generalizes (hidden) Markov chain models by allowing for long range dependencies, including Markov chains of arbitrarily large orders. We characterize the asymptotic behavior of the Gibbs posterior distribution on the parameter space as the number of observations tends to infinity. In particular, we define a limiting variational problem over the space of joinings of the model system with the observed system, and we show that the Gibbs posterior distributions concentrate around the solution set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMarkov Chains and Monte Carlo Methods · Bayesian Methods and Mixture Models · Statistical Methods and Inference
Gibbs posterior convergence and the thermodynamic formalism
Kevin McGoff
UNC Charlotte
,
Sayan Mukherjee
Duke University
and
Andrew Nobel
UNC Chapel Hill
Abstract.
In this paper we consider a Bayesian framework for making inferences about dynamical systems from ergodic observations. The proposed Bayesian procedure is based on the Gibbs posterior, a decision theoretic generalization of standard Bayesian inference. We place a prior over a model class consisting of a parametrized family of Gibbs measures on a mixing shift of finite type. This model class generalizes (hidden) Markov chain models by allowing for long range dependencies, including Markov chains of arbitrarily large orders. We characterize the asymptotic behavior of the Gibbs posterior distribution on the parameter space as the number of observations tends to infinity. In particular, we define a limiting variational problem over the space of joinings of the model system with the observed system, and we show that the Gibbs posterior distributions concentrate around the solution set of this variational problem. In the case of properly specified models our convergence results may be used to establish posterior consistency. This work establishes tight connections between Gibbs posterior inference and the thermodynamic formalism, which may inspire new proof techniques in the study of Bayesian posterior consistency for dependent processes.
Kevin McGoff would like to acknowledge funding from NSF grant DMS 1613261.
Sayan Mukherjee would like to acknowledge funding from NSF DEB-1840223, NIH R01 DK116187-01, HFSP RGP0051/2017, NSF DMS 17-13012, and NSF DMS 16-13261.
Andrew Nobel would like to acknowledge funding from NSF DMS-1613261, NSF DMS-1613072, NIH R01 HG009125-01.
1. Introduction
In this work we establish asymptotic results concerning Bayesian inference for certain dynamical systems. We consider a fairly general framework in which both the observations and the fitted models arise from dynamical systems. Our analysis brings together two distinct strands of research, both of which were originally inspired by connections with statistical physics: the thermodynamic formalism in dynamical systems, and the Gibbs posterior principle for Bayesian inference in statistics. Our work highlights the substantial connections between these two areas and shows that together they produce a natural framework for Bayesian inference about dynamical systems. Our general results guarantee the concentration of Gibbs posterior distributions around certain sets of parameters that are characterized by a variational principle. As applications of these general results, we also establish posterior consistency results for some classes of dynamical models, generalizing previous posterior consistency results for Markov and hidden Markov models in Bayesian nonparametrics.
1.1. Observed system
Our inference framework consists of two main components. The first component is an observed dynamical system, defined as follows. Let be a complete separable metric space. Here and throughout this work we assume that all such spaces are endowed with their Borel -algebras, and we suppress this choice in our notation. Let be a Borel measurable map. We let denote the set of Borel probability measures on , endowed with the weak∗ topology on measures. For , we say that is invariant under if for all Borel sets . The set of -invariant measures in is denoted by . Furthermore, we say that is ergodic if for all Borel sets satisfying . Our standard assumption is that the observed system has the form , where is ergodic.
1.2. Model families
The second component of our inference framework is a collection of models. In order to model dynamics in the standard statistical setting, one typically considers (hidden) Markov models or more complex state space models. In our analysis we would like to be able to handle model processes with long range dependencies, and so we consider a general class of models known as Gibbs measures. This class of models strictly generalizes the class of finite state Markov models with arbitrarily large order.
Before giving a precise definition of a Gibbs measure, we must first introduce the underlying state space for such models, which is called a mixing shift of finite type (SFT). A shift of finite type is a dynamical system that is the topological analogue of a finite state aperiodic and irreducible Markov chain. SFTs have been widely studied in the dynamical systems literature, both for their own sake [32] and as model systems for some smooth systems such as Axiom A diffeomorphisms [7]. Furthermore, SFTs have substantial connections to statistical physics and other fields such as coding and information theory [32, 41].
Here we give a proper definition for a mixing SFT. Let be a finite set, known as an alphabet, and let be the set of bi-infinite sequences with values in . Define the left-shift map by . A set is called an SFT if there exists and a collection of words such that is exactly the set of sequences in that contain no words from :
[TABLE]
Here is called a set of forbidden words for . Note that by choosing , one obtains the full sequence space , which is known as the full shift (on the alphabet ). Also, we endow with the discrete topology and with the product topology, which makes any such closed and compact. We define the map to be the restriction of the left shift to . Let denote the set of words of length (i.e., elements of ) that appear in at least one point of , and let . An SFT is said to be mixing if for any two words , there exists such that for all , there exists a word such that . The following equivalent definition is perhaps more intuitive to readers familiar with Markov chains. Let be the square matrix indexed by defined for for words by the rule
[TABLE]
Then is mixing if and only if there exists such that contains all positive entries. Our standard assumption on is that it is a mixing SFT.
To model stochastic behavior on the topological system , we consider a family of -invariant probability measures on , called Gibbs measures. To introduce Gibbs measures, one begins with a function , which is called a potential function (or just a potential). A Borel probability measure on is said to be a Gibbs measure corresponding to the potential function if there exists constants and such that for all and ,
[TABLE]
where is the cylinder set of points in such that for all . The property in (1) is called the Gibbs Property. By a celebrated result of Bowen [7], under mild regularity conditions on , there is a unique Gibbs measure with potential function , and furthermore the measure is ergodic. The constant is called the pressure of .
The Gibbs measure is a generalization of the canonical ensemble in statistical physics to infinite systems. Potential functions have natural connections with Hamiltonians in the study of lattice systems in statistical physics. In considering inference, we will think of loss functions as potential functions. We remark (again) that the class of Gibbs measures strictly generalizes the class of Markov chains, allowing for arbitrarily long dependencies. Indeed, any Markov chain of order on the alphabet can be realized as a Gibbs measure by an appropriate choice of a potential function that depends on only coordinates. On the other hand, when the potential function depends on infinitely many coordinates, the corresponding Gibbs measure is not Markov of any order. In this way, our model families may include Markov chains with unbounded orders, which highlights the degree of dependence allowed by our framework.
Lastly, let us mention the regularity condition that we require our model families to satisfy. For points in , we let denote the infimum of all such that . Then we define a metric on by setting . For , we let denote the set of continuous functions from to with Hölder exponent , that is, the set of functions for which there exists a constant such that for all ,
[TABLE]
Furthermore, we endow with the topology induced by the norm , where
[TABLE]
Now we define the regularity condition necessary for our model families.
Definition 1**.**
Let be compact metric space. A parametrized family of potential functions will be called a regular family if there exists such that and the map is continuous in the topology induced by the norm .
If a family is a regular family, then the map is continuous in the weak∗ topology on measures, and the constants and that appear in (1) depend continuously on (see [3]). Furthermore, since is compact, we get a uniform Gibbs property: there exists a uniform constant and a continuous function such that for all , , and ,
[TABLE]
We assume throughout that is a regular family of potential functions, and our model class consists of the corresponding parametrized family of Gibbs measures .
1.3. Inference
The inference paradigm we consider is known as Gibbs posterior inference, which is a generalization of the standard Bayesian inference framework. The basic idea behind the Gibbs posterior [6, 25] is to replace the likelihood with an exponentiated loss or utility function in the standard Bayesian procedure for updating beliefs about an unknown parameter of interest . Whereas the standard Bayes posterior takes the form
[TABLE]
the Gibbs posterior has the form
[TABLE]
where is the loss associated with based on the observed data. When the loss function is the negative log-likelihood then the two paradigms are identical. The original motivation for the Gibbs posterior was to specify a coherent procedure for Bayesian inference when the parameter of interest is connected to observations via a loss function, rather than the classical setting where the likelihood or true sampling distribution is known; see [6] for more arguments in favor of the Gibbs posterior and discussion about how the Gibbs posterior framework addresses model misspecification and robustness to nuisance parameters. Note that in the general Gibbs posterior framework without a likelihood, there is no generative model assumed for the observations.
We consider models indexed by a compact metric space (with metric denoted ), which will serve as a parameter space. The elements of will be used to parametrize both the dependence structure of the Gibbs measures in our model class (e.g., transition probabilities) and the relationship between states and observations (e.g., emission probabilities). Recall that as part of our standard assumptions, we assume that our model class is a family of Gibbs measures on corresponding to a regular family of potential functions (as in Definition 1).
Also recall that the observed system has state space with invariant measure . Define a metric on by the rule
[TABLE]
Here and throughout this work, we assume that we have a loss function satisfying the following conditions:
- (i)
is continuous; 2. (ii)
there exists a measurable function such that for all , , and ; 3. (iii)
for each there exists a measurable function such that for each ,
[TABLE]
and .
Condition (ii) is an integrability condition on the loss, while condition (iii) is a requirement on the modulus of continuity of the loss. In Section 1.5 we provide examples of loss functions satisfying these conditions. Note that including the parameter in the loss function may be considered non-standard in statistics. However, this formulation will simplify notation throughout the paper, and in Section 1.5 we establish that this setting is equivalent to the standard one. Also note that the dependence of the loss on and on the uncountable space allows us to model continuous observations and emission probabilities.
With the loss function and parameter fixed, we define the loss of the finite sequence with respect to a finite sequence of observations to be the sum of the per-state losses:
[TABLE]
When and are initial segments of trajectories of and , respectively, we write instead of .
Let us now give the definition of Gibbs posterior distributions on . Here we consider the subjective case, in which one begins with a prior distribution on . Let be a fully supported Borel probability measure on , which will serve as our prior distribution. First we extend to form a prior distribution on as follows. Given the family of Gibbs measures on , consider the induced prior distribution on defined for any Borel set by
[TABLE]
According to the Gibbs posterior paradigm [6, 25], if we make observations , then our updated beliefs should be represented by the Gibbs posterior distribution. This distribution is the Borel probability measure on defined for Borel sets by
[TABLE]
where is the normalizing constant (partition function), given by
[TABLE]
Then the Gibbs posterior distribution on is simply the -marginal of , which is defined for Borel sets by
[TABLE]
As we are considering a Bayesian framework, all inference about the parameters based on the observations is derived from the posterior . We focus here on inference regarding the parameters in , since inference regarding the initial condition in is known to be impossible for many dynamical systems, including shifts of finite type [28, 29]. Let us summarize our framework.
- •
We begin with a fully supported prior on a compact set that smoothly parametrizes a family of Gibbs measures on .
- •
From and , we create an extended prior on .
- •
We obtain observations in from a stationary ergodic process .
- •
From , the observations, and the loss function we obtain the Gibbs posterior on .
- •
Finally, we marginalize to get the posterior on .
1.4. Main results
Our analysis begins with an examination of the exponential growth rate of the (random) partition function for large . In particular, we establish a variational principle for the almost sure limit of as tends to infinity.
Theorem 1**.**
Under the standard assumptions stated above there exists a lower semicontinuous function such that for -almost every ,
[TABLE]
Remark 1**.**
The compactness of and lower semicontinuity of ensure that the infimum in Theorem 1 is obtained. The conclusion of the theorem is similar to a large deviations principle (see for example [15]), with playing the role of the rate function. For this reason, we refer to as the rate function in this setting. A detailed discussion of appears in Section 3, where we show that can be expressed as the sum of an expected loss term and a divergence term.
The variational expression that appears in Theorem 1 suggests that we focus on the (non-empty, compact) set of parameters that minimize this expression. Let
[TABLE]
In our second main result, we establish that the Gibbs posterior distribution must concentrate around this set.
Theorem 2**.**
For any open neighborhood of , for -almost every , we have
[TABLE]
In light of this result, it is possible to answer questions about Gibbs posterior consistency by analyzing the variational problem defining . We illustrate this approach to posterior consistency in several applications (see Section 2).
Remark 2** (Optimality of ).**
One may wonder whether actually concentrates around a strict subset of . Proposition 8 addresses this question on the exponential scale. It states that if is open and intersects , then the posterior probability of cannot be exponentially small as tends to infinity, i.e., for -almost every , the quantity tends to zero as tends to infinity.
Remark 3** (Ground states and MAP).**
From a thermodynamic perspective, it is natural to introduce an inverse temperature parameter and consider the new loss function . In this setting, one would like to understand what happens as tends to infinity. In Section 3.7, we identify the limit of both and as tends to infinity in terms of variational problems considered in previous work [36].
The use of an inverse temperature parameter has also been used in practice to perform maximum a posteriori (MAP) estimation. MAP estimation is a common alternative to fully Bayesian inference that is used in both statistics and machine learning. It involves finding the parameter that is the posterior mode. The motivation for MAP estimation is often computational efficiency and the lack of a need for uncertainty quantification. The idea of adding an inverse temperature parameter () to a Gibbs distribution for MAP estimation was introduced for Bayesian models in a seminal paper by Geman and Geman [18], who also gave an annealing schedule to increase the inverse temperature with a provable guarantee for finding the posterior mode.
Remark 4** (Connections to penalization).**
The formulation of Bayesian updating as a variational problem with an entropic penalty has been previously explored [6, 51], and these ideas are related to Jaynes’ maximum entropy formulation of Bayesian inference [23]. In both [6] and [51], posterior inference was formulated as follows: given a loss function and a prior , the posterior distribution is
[TABLE]
where is the relative entropy between and . The function being minimized above has close connections to the function in Theorem 1; see Definition 3 below.
Remark 5** (Convergence of full Gibbs posteriors).**
Our main results establish the concentration of the -marginal posterior distributions around the limit set . In contrast, the -marginal of the full posterior distribution need not concentrate around any particular subset of (according to the negative results of [28, 29]). Nonetheless, Proposition 9 gives a characterization of any Cesàro limit of the full posteriors.
Remark 6** (Importance of the Gibbs property).**
The Gibbs property (1) of the measures makes them particularly suitable as model distributions for the purposes of Gibbs posterior inference. In general, invariant measures for dynamical systems do not admit such exponential estimates, and it is precisely these estimates that allow us to carry out our asymptotic analysis.
Remark 7** (Continuous model systems).**
Results similar to those here may be established for certain families of differential dynamical systems on manifolds. In particular, using our results and the well-known connections between SFTs and Axiom A systems (see [7]), it is possible to establish analogous conclusions for Axiom A diffeomorphisms with Gibbs measures.
1.5. Examples of inference settings and associated loss functions
Here we describe some possible inference settings that fit into our framework. Note that these settings give examples of loss functions that satisfy conditions (i)-(iii).
Example 1**.**
(Continuous, deterministic observations) Suppose that the state space of the observed system is a subset of the real line, so that the observations are real-valued and deterministic. In this case, we might wish to fit the observations to a family of models generated by a family of Gibbs measures on a fixed mixing SFT , with associated prior , and a continuously parametrized family of continuous observation functions. Given and , the initial part of the real-valued sequence can be fit to the the observations. Models of this sort are called dynamical models, and they have been studied in the context of empirical risk minimization in [37]. If the measure has finite second moment, and is the squared loss , then conditions (i)-(iii) on the loss are satisfied.
Example 2**.**
(Discrete observations) Let and be finite sets. We suppose that we make -valued observations, that is, , and we wish to model these observations with a family of Gibbs measures on a mixing SFT with . Further, let be an observation function, so that a point in gives rise to the -valued sequence . Let be the discrete loss, . Then the conditions (i)-(iii) on the loss are satisfied.
Example 3**.**
(Family of conditional likelihoods) Suppose that is a family of conditional densities on with respect to a common Borel measure on . Here is the conditional likelihood of a single observation given the parameter and system state . Under appropriate continuity and integrability conditions on the family of likelihoods, the negative log-likelihood function, , satisfies conditions (i)-(iii). In this situation, the Gibbs posterior is the same as the standard Bayes posterior. Furthermore, the dependence of the loss on the parameter allows one to parametrize the conditional observation densities, as in the parametrization of emission densities in the study of hidden Markov models. Note that in the Gibbs posterior framework, the true observation system may be fully misspecified–it need not be related to any of the generative processes implied by the family of Gibbs measures and conditional likelihoods.
2. Applications
In this section we present two applications of our main results on Gibbs posterior consistency to standard posterior consistency for two models. In the first, we establish Bayesian posterior consistency for direct observations of Gibbs processes. Interestingly, the proof of this result may be reduced to a classical result of Bowen on uniqueness of equilibrium states. In the second application, we establish Bayesian posterior consistency for hidden Gibbs processes. This result generalizes previous results on posterior consistency for hidden Markov models by allowing substantially more dependence in the hidden processes, including families of Markov chains with unbounded orders.
2.1. Direct observations of Gibbs processes
Let be a mixing SFT, and let be a regular family of Hölder potential functions (as in Definition 1). We let be the unique Gibbs measure associated with . Additionally, let be a fully supported prior distribution on .
Here we consider the properly specified situation with direct observations. That is, we suppose that there exists such that our observations are the coordinates of the ergodic system , observed without noise. In other words, the observation at time is the th coordinate of the stationary ergodic process with distribution . In this case, the likelihood of observing under is simply given by , where . Let be the standard Bayesian posterior distribution on given the observations , i.e., for Borel sets ,
[TABLE]
Let denote the identifiability class of , which is naturally defined as the set of all such that . The following theorem states that concentrates around , establishing posterior consistency in this setting.
Theorem 3**.**
Let be an open neighborhood of . Then with probability one (with respect to ), we have
[TABLE]
The proof of Theorem 3 is based on the principal results above. In particular, we apply Theorem 2 to establish that the posterior concentrates around a set , which is characterized as the solution set of a variational problem. We then use the variational characterization of and additional problem-specific arguments to show that is equal to the identifiability class . In this example, these problem-specific arguments rely on one of the foundational results of Bowen [7] in the thermodynamic formalism, namely the uniqueness of equilibrium states for Hölder continuous potentials on a mixing SFT.
2.2. Hidden Gibbs processes
In this section we consider posterior consistency for more general observation processes. Let be a mixing SFT, a regular family of Hölder potential functions (as in Definition 1), and the corresponding family of Gibbs measures. Let be any fully supported prior distribution on .
The novel feature of the present setting is that we allow for general observations of the underlying family. Suppose that is a -finite Borel measure on a complete separable metric space , and that is a jointly continuous function such that for all and ,
[TABLE]
We regard as a family of conditional likelihoods for given and . We assume that the function given by satisfies the integrability and regularity conditions (i)-(iii) from Section 1.3. Furthermore, we require condition (L2) from [34], which stipulates that there exists and a Borel measurable function such that for each , the function is -Hölder continuous with constant , and for each ,
[TABLE]
This condition may be viewed as a condition on the regularity of the conditional density functions; it is used in [34] to control the likelihood function in the large deviations regime.
With these conditions in place, we assume that the conditional likelihood of observing given is
[TABLE]
and that the likelihood of observing given is
[TABLE]
In other words, for each , we have an observed sequence generated as follows: select according to and for each let have density with respect to . Denote by the process measure for the process , which has likelihood .
Now let be the standard Bayesian posterior distribution on given observations based on the prior and the likelihood : for Borel sets ,
[TABLE]
We again consider the properly specified case, in which there exists a parameter such that the observed process is drawn from . In order to address posterior consistency, we define the identifiability class of , denoted , to be the set of such that ; in other words, a parameter is in if its associated process has the same distribution as the process generated by . The following result establishes posterior consistency in this setting.
Theorem 4**.**
Let be an open neighborhood of . Then
[TABLE]
The proof of Theorem 4 is based on the principal results above. In particular, we use these results to establish convergence of the posterior distribution, and then we use problem specific arguments to prove that the limit set is equal to the identifiability set . In this case, the problem-specific arguments rely on previously studied connections between large deviations for Gibbs measures and identifiability of observed systems [34].
3. Joinings, divergence, and the rate function
In this section we discuss the rate function , whose existence is asserted by Theorem 1. In order to provide a thorough discussion, we first recall some background material from ergodic theory, including joinings and fiber entropy.
3.1. Joinings
Joinings were introduced by Furstenberg [16], and they have played an important role in the development of ergodic theory (see [11, 20]). Suppose and are two probability measure-preserving Borel systems with and . The product transformation is defined by . A joining of these two systems is a Borel probability measure on with marginal distributions and that is invariant under the product transformation . Thus, a joining is a coupling of the measures and that is also invariant under the joint action of the transformations and ; the former condition concerns the invariant measures of the two systems, while the latter concerns their dynamics. Let denote the set of all joinings of and . Note that this set is non-empty, since the product measure is always a joining. When the transformation is fixed but we have not associated any invariant measure with it, we set
[TABLE]
which is the family of joinings of with all systems of the form , with .
3.2. Entropy
Our statements and proofs also require us to introduce some notions from the entropy theory of dynamical systems. Let be a compact metric space, continuous, and . For any finite measurable partition of , we define
[TABLE]
where by convention. For , let , and for any partitions , define their join to be the mutual refinement
[TABLE]
For , let . By standard subadditivity arguments, the following limit exists:
[TABLE]
The measure-theoretic or Kolmogorov-Sinai entropy of with respect to is given by , where the supremum is taken over all finite measurable partitions of . We note for future reference that for any , the value remains the same if the supremum is instead taken over all finite measurable partitions with diameter less than . When the transformation is clear from context, we may omit the subscript.
3.3. The variational principle for pressure
Let be a mixing SFT, and let be a Hölder continuous potential. The variational principle [7] for the pressure states that
[TABLE]
and furthermore, the supremum is achieved by the measure if and only if is the Gibbs measures associated with .
3.4. Disintegration of measure
The following result is a special case of standard results on disintegration of Borel measures (see [20]).
Theorem** (Disintegration of measure).**
Let and be standard Borel spaces, and be the natural projection. Let , and let be its image in . Then there is a Borel map , from to such that for every bounded Borel function ,
[TABLE]
Moreover, such a map is unique in the following sense: if is another such map, then for -almost every .
Note that if is a joining, then the family satisfies an important invariance property, which we state as Lemma 3 in Section 5.3.
3.5. Fiber entropy
Now we give a definition of fiber entropy, along with statements of some properties relevant to this work; for a thorough introduction, see [27]. Let be a compact metric space and be a separable complete metric spaces. Further, let be continuous and be Borel measurable. For any Borel probability measure on with -marginal , let be its disintegration over . Then for any finite measurable partition of , we define
[TABLE]
Now suppose . It’s possible to show (see, e.g., [26]) that if and is its disintegration over , then for every finite measurable partition of the following limit exists:
[TABLE]
where . Furthermore, when is ergodic, it can be shown (again see [26]) that for almost every ,
[TABLE]
The fiber entropy of over is defined as , where the supremum is taken over all finite measurable partitions of . Note that the supremum may also be taken over partitions with diameter less than any . The fiber entropy quantifies the relative entropy of over .
3.6. Divergence terms
Consider a parameter and a joining . We would like to quantify the divergence of the joining to the product measure , as it will play a role in the rate function . (Note that the measure may be interpreted as a prior distribution on given , as the prior on is assumed to be independent of the observations.) However, the standard KL-divergence is insufficient for our purposes, since any two ergodic measures for a given system are known to be mutually singular, and hence their KL-divergence will be infinite. Instead, we make the following definitions, which are more suitable for dynamical systems.
Given two Borel probability measures and on a compact metric space and a finite measurable partition of , we write whenever implies that for . Let
[TABLE]
where for any by convention. Note that is the KL-divergence from to with respect to the partition , which is nonnegative.
Now consider a Hölder continuous potential on a mixing SFT with associated Gibbs measure . Let be the partition of into cylinder sets of the form for some , and let be ergodic. In this situation, it is known [8] that
[TABLE]
where we recall that is the pressure of , the partition is defined to be , and is the entropy of with respect to . Next we generalize this result to handle the relative situation, which involves joinings and relative entropy.
Lemma 1**.**
Let be a Hölder continuous potential on a mixing SFT with associated Gibbs measure . Let be the partition of into cylinder sets of the form , and let be ergodic. Then for -almost every ,
[TABLE]
We defer the proof of Lemma 1 to Section 5.6. Based on this lemma, we make the following definition.
Definition 2**.**
Let be a mixing SFT, a Hölder continuous function, and the associated Gibbs measure. Further, let be an ergodic system. Then define the relative divergence rate of to to be
[TABLE]
In the present setting, is always finite, and one may check that it is also nonnegative (see Lemma 8).
3.7. The rate function
In this section we define and discuss the rate function whose existence is guaranteed by Theorem 1.
Definition 3**.**
For , let
[TABLE]
Note that the variational expression defining contains the sum of an expected loss term and a divergence term. It is known that Bayesian posterior distributions satisfy a similar variational principle in the finite sample setting (see [25, 52, 53]). Our results show that this interpretation passes to the limit as the number of samples tends to infinity.
By Proposition 5, which appears in Section 6, we have that is lower semi-continuous. Since the loss function is continuous, the proof the Proposition 5 essentially follows from the the upper semi-continuity of the fiber entropy on the space of joinings .
Remark 8**.**
Consider the introduction of an inverse temperature parameter , as discussed in Remark 3, and let be the associated loss function. If we let be the associated rate function, then we see from Definition 3 that
[TABLE]
Dividing by and letting tend to infinity to investigate the ground state behavior, it is clear that the associated variational expression is
[TABLE]
Interestingly, this variational expression has been studied recently as part of an asymptotic analysis of estimators based on empirical risk minimization for dynamical systems [36, 37]. Indeed, the solution set of this ground state variational problem exactly characterizes the set of possible limits of parameter estimates that asymptotically minimize average empirical risk.
4. Connections to previous work
In the setting of i.i.d. samples, Doob [13] established Bayesian posterior consistency for almost every parameter value in the support of the prior using Martingale methods. Later, Schwartz [43] gave necessary and sufficient conditions for posterior consistency at individual parameter values in the i.i.d. setting; these conditions require that the prior charge all KL-neighborhoods of the parameter and that there exist a sequence of tests giving exponential separation of the parameter from other parameters. The challenges and pitfalls of proving posterior consistency for nonparametric models were highlighted by Diaconis and Freedman in [12]. The negative results motivated much of the recent work in Bayesian nonparametrics, as well as studying convergence in other metrics on the space of probability distributions (such as Hellinger), and consideration of rates of convergence. For a detailed review of the Bayesian nonparametic literature we refer the reader to the recent book by Ghosal and van der Vaart [19].
Recent years have witnessed substantial interest in moving beyond the i.i.d. setting and considering statistical inference for dependent processes, including processes arising from dynamical systems. Statistical problems receiving recent attention in the context of dynamical systems include denoising (or filtering) [28, 29], consistency of maximum likelihood estimation [34], forecasting and density estimation [21, 44], empirical risk minimization [36, 37], and data assimilation and uncertainty quantification [30]. For a survey of this area, see [35]. Bayesian posterior consistency for dependent processes has also received attention in the literature. In particular, posterior consistency has been established for certain families of finite state hidden Markov chains [9, 14, 17, 45].
The idea of a variational formulation of Bayesian inference was developed by Zellner [51] and the link between statistical mechanics and information theory with Bayesian inference was at the heart of the inference framework advocated by Edwin T. Jaynes [24], a perspective that influenced Zellner [51]. Formulating Bayesian inference as a variational problem for infinite dimensional problems has been explored in the control theory and inverse problems literatures [33, 38, 39]. In [39] a variational formulation of Bayesian inference was developed for the problem of channel coding using ideas from statistical mechanics. In [38] a variational characterization of Bayesian nonlinear estimation was shown to take the same form Gibbs measures in statistical mechanics. In [33], the authors studied the inference problem of finding the most likely path given a Brownian dynamics model from molecular dynamics, which takes the form of a gradient flow in a potential, subject to small thermal fluctuations. In this problem setup, a variational solution was proposed for Bayesian inference.
The setting and results of [36] and [37] are worthy of some discussion, as they may be considered frequentist analogues of the present work. Indeed, the setting of this previous work involves observations from an unknown ergodic system, a model family consisting of topological dynamical systems, and a loss function connecting the models to the observations, as in the present work. Given this setting, the previous work analyzes the convergence of parameter estimates obtained by empirical risk minimization, whereas we study the convergence of parameter estimates based on Bayesian updates (in the form of the Gibbs posterior). One additional difference is that the previous results on empirical risk minimization are more general, in the sense that the model families need only consist of continuous maps on compact metric spaces; whereas, in our Bayesian setting, we specialize to the case of SFTs with Gibbs measures. This focus on Gibbs measures in the Bayesian setting arises precisely because Gibbs measures satisfy the necessary exponential estimates (the Gibbs property (1)) to make the asymptotic analysis work. It should be noted that a Bayesian framework provides estimates of uncertainty which empirical risk minimization does not.
The Gibbs posterior principle can be derived from general principles as a valid method to update belief distributions in the presence of a loss function [6]. In particular, this framework for updating beliefs remains valid when one does not have access to a true likelihood. This inference framework has also been shown to have advantages in some settings [25]. One of the motivations for the use of the Gibbs posterior in [25] was that exponentiating a robust loss function can better accommodate model misspecification, e.g., when the assumed likelihood is not the sample generating process. A logistic regression example is provided in [25] for which the usual posterior-based logistic regression produces suboptimal classification error even from among the misspecified logistic regression models, while the Gibbs posterior is optimal. Another argument for using a loss-based approach comes from the robust statistics literature [22]. A key idea in robust statistics is that one can define loss functions that are not sensitive to contamination of standard error or likelihood models. Thus, even if the model is misspecified, inference using the robust loss function is still reliable. The advantage of the Gibbs posterior framework is that one can specify coherent Bayesian updating using a robust loss function and not have to specify the data generating process.
The thermodynamic formalism in dynamical systems, originally pioneered by Sinai, Ruelle, and Bowen, involves adaptation of many ideas and methods from statistical physics to the setting of dynamical systems, and it has played a large role in the development of ergodic theory and dynamical systems over many years. For an introduction to the area and some connections to statistical physics, see the books by Bowen [7], Ruelle [41], or Walters [46]. Let us mention that connections to Markov chains and other stochastic processes have a long history in this area [5, 49, 50]. Additionally, relative equilibrium states were studied by Ledrappier and Walters [31], and recent results on uniqueness of relative equilibrium states [1, 2, 4, 40] may contain interesting ideas to apply towards Bayesian posterior consistency.
5. Technical preliminaries
This section contains several technical results that we use later in the proofs of the main theorems.
5.1. Pressure Lemma
We refer to the following elementary fact, which is an easy consequence of Jensen’s inequality, as the Pressure Lemma; see [46, Lemma 9.9].
Lemma 2**.**
Let be real numbers. If and , then
[TABLE]
with equality if and only if
[TABLE]
5.2. The space of joinings and the ergodic decomposition
Our proofs rely on a general version of the ergodic decomposition for invariant probability measures. The following version, a restatement of [42, Theorem 2.5], is sufficient for our purposes.
Theorem** (The Ergodic Decomposition).**
Suppose that is a Borel measurable map of a Polish space and that . Then there exists a Borel probability measure on such that
- (1)
Q\bigl{(}\{\eta\mbox{ is invariant and ergodic for R}\}\bigr{)}=1** 2. (2)
If , then for -almost every , and
[TABLE]
Whenever (2) holds, we write .
Additionally, we require the following results about the structure of from [36].
Theorem** (Structure of the space of joinings).**
Suppose is a continuous map of a compact metrizable space and is an ergodic measure-preserving system as in Section 1. Then is non-empty, compact, and convex. Furthermore, a joining is an extreme point of if and only if is ergodic for . Lastly, if and is its ergodic decomposition, then -almost every is in .
Let . By the above theorem, the ergodic decomposition of is a representation of as an integral combination of the extreme points of . A function is called harmonic if for each ,
[TABLE]
where is the ergodic decomposition of .
5.3. Disintegration results
Suppose is a continuous map of a compact metric space and is an ergodic system. It is well known in ergodic theory (see [20]) that for any joining , if is its disintegration over , then the family of measures satisfies an additional invariance property, which we state in the following lemma.
Lemma 3**.**
Let , and let be its disintegration over . Then for -almost every , and hence, for every and -almost every ,
[TABLE]
5.4. Limiting average loss
The following lemma will be applied to the limiting average loss. Recall that when is a continuous map of a compact metric space, the space of joinings is non-empty. For notation, if , then we let .
Lemma 4**.**
Suppose that is a Borel self-map of a complete metric space , and that is a Borel function for which there exists in such that for each . Then for any joining with disintegration over , for -almost every ,
[TABLE]
Proof.
For define . Then , since (using the hypotheses involving ). Now Lemma 3, together with the pointwise ergodic theorem, yields that for almost every ,
[TABLE]
∎
5.5. Fiber entropy
We require two additional properties of the fiber entropy in our setting. The first property is that fiber entropy is harmonic. This fact appears with proof as Lemma 3.2 (iii) in [31] in a setting under which is a continuous map of a compact space, but careful inspection shows that the proof does not depend on this hypothesis.
Lemma 5**.**
The map from to the non-negative extended reals satisfies the following property: if is the ergodic decomposition of , then
[TABLE]
Next, we note that fiber entropy function is upper semi-continuous in our setting. The proof of Lemma 2.2 in [47] establishes upper semi-continuity of fiber entropy in a setting closely related to ours. By making only minor modifications of that proof, one may adapt it to our setting and prove the following lemma.
Lemma 6**.**
Let , , and be as in the introduction, and let act on the product space . Then the map from to is upper semi-continuous.
5.6. Divergence terms and average information
Define
[TABLE]
where by convention. With these definitions, we always have
[TABLE]
Recall that may be interpreted as the expected information of under the partition , where the expectation is with respect to . In contrast, may be interpreted as the expected information of under the partition , where the expectation is again taken with respect to . In what follows, if is a partition of a space and , we let denote the partition element containing . Here we restate and then prove Lemma 1.
Lemma 7**.**
Let be a Hölder continuous potential on a mixing SFT with associated Gibbs measure . Let be the partition of into cylinder sets of the form , and let be ergodic. Then for -almost every ,
[TABLE]
Proof.
Recall that by the Gibbs property for , for any and in , we have
[TABLE]
Taking logarithms yields the bound
[TABLE]
As this inequality is uniform in , we may integrate with respect to to obtain
[TABLE]
Dividing by and applying Lemma 4 gives
[TABLE]
It follows from (10) that . Since is ergodic, for -almost every , we have , where the equality is a result of the fact that is a generating partition for . Combining this fact with (11), we find that for -almost every ,
[TABLE]
as desired. ∎
Now we prove a lemma that guarantees that .
Lemma 8**.**
For each and ,
[TABLE]
Proof.
Let be the -marginal of . Then , where the inequality follows from elementary information theoretic facts concerning conditional entropy (see [10]) and the equality is a basic property of fiber entropy. Then by the variational principle for pressure (7),
[TABLE]
as desired. ∎
We now establish a lemma that is used in the proof of Theorem 1. This result allows us to approximate the expected information in the prior , where the expectation is with respect to an arbitrary measure, in terms of an average of a continuous function. These types of estimates are available precisely because our model class consists of Gibbs measures: indeed, they do not hold for arbitrary invariant measures for dynamical systems.
For any Borel probability measure on , let denote its time-average up to time :
[TABLE]
where is the identity.
Lemma 9**.**
Let be the constant in the uniform Gibbs property (2). For any there exists such that if the diameter of is less than and is the partition of into cylinder sets of the form , then for any Borel probability measure on , and any ,
[TABLE]
Proof.
Let . By the uniform continuity of and in and the uniform Gibbs property, there exists such that if the diameter of is less than and is the partition of into sets of the form , then for all , , and ,
[TABLE]
Taking logarithms and dividing by , we obtain the inequality
[TABLE]
which is uniform over . Now let be any Borel probability measure on . Then by integrating with respect to , we see that
[TABLE]
∎
6. Semicontinuity of the rate function and
Proposition 5**.**
The map defined in Definition 3 is lower semi-continuous, and hence the set is compact and non-empty.
Proof.
Let and let be given by , where is the identity on . Define by
[TABLE]
which is continuous and satisfies . Finally, define by
[TABLE]
Since is continuous and is upper semi-continuous (by Lemma 6), is upper semi-continuous. Let be defined by setting to be the -marginal of , which is a continuous surjection of compact spaces. One may easily check from the definition of upper semicontinuity that the function
[TABLE]
is also upper semicontinuous. Since is the negative of this function, we conclude that is lower semi-continuous.
For the second part of the proposition, we note that is the of the lower semi-continuous function on the compact set , and hence it is non-empty and compact. ∎
7. Convergence of the partition function and a variational principle
In this section, we prove Theorem 1, which concerns the convergence of the average log normalizing constant (partition function) . The starting point of the proof, which is an application of the Pressure Lemma, allows us to express the main statistical object, the Gibbs posterior distribution, as the solution of a variational problem involving information theoretic notions such as entropy and average information, which have long been studied in dynamics. The proof of Theorem 1 follows.
To ease notation slightly in this section, we let and , where is defined in (3). We also set and . For , we will have use for the notation
[TABLE]
Although we do not use this fact, we note that can be written as an integral over of terms of the form (as in Definition 2). Lemma 5 ensures that is harmonic, and therefore the same is true of . In this notation, our goal is to prove
[TABLE]
We present the proof in two stages: first we establish that the expression in right-hand side is a lower bound for , and then we prove that the same expression provides an upper bound.
7.1. Lower bound
The goal of this section is to prove the following result.
Proposition 6**.**
For -almost every ,
[TABLE]
where .
Before proving this proposition, we first establish a lemma. If is a Borel probability measure on and , then let denote the conditional distribution . Also, we say that is a partition of according to central words whenever for some .
Lemma 10**.**
Let be a finite measurable partition of with , and let be a partition of according to central words such that . Then for any Borel probability measure on , any , and any ,
[TABLE]
where is the local difference function appearing in property (iii) of the loss.
Proof.
If , then the inequality holds trivially. Now suppose , and let . For and , property (iii) of the loss function, and our hypotheses on and yield that
[TABLE]
Integrating out with respect to the conditional distribution gives
[TABLE]
After exponentiation and integration with respect to the , we get
[TABLE]
Invoking Lemma 2 and the inequality above, we find that
[TABLE]
as was to be shown. ∎
Proof of Proposition 6. Fix an ergodic joining and . Let be sufficiently small that the bound of Lemma 9 holds and that (using property (iii) of the loss). Fix a finite measurable partition of such that , and select large enough so that the partition of generated by central words of length satisfies . Then for -almost every ,
[TABLE]
where the inequality follows from Lemma 10. Dividing each side of the inequality above by , and then letting tend to infinity, Lemma 4, Lemma 9, and the ergodic theorem together imply that for -almost every ,
[TABLE]
Taking the supremum over all partitions of with diameter less than and all partitions of generated by central words of length at least , we obtain the inequality
[TABLE]
Since was arbitrary,
[TABLE]
As this inequality holds for all ergodic and the left-hand side is harmonic in , we have
[TABLE]
which completes the proof.
7.2. Upper bound
In Proposition 7 below we establish an almost sure upper bound on the limiting behavior of . Together with the lower bound in Proposition 6, this completes the proof of Theorem 1.
Proposition 7**.**
For -almost every ,
[TABLE]
We begin with a preliminary lemma. Recall that is the prior distribution on generated by the prior (defined in (4)) and the family , while is the Gibbs posterior distribution associated with (defined in (5)). To simplify notation, in what follows is denoted by .
Lemma 11**.**
If is a finite measurable partition of with diameter less than then for and ,
[TABLE]
Proof.
Let be a finite measurable partition of with , and let . By definition and are equivalent measures, and hence and . Let .
Fix for the moment. For points the hypothesis on ensures that
[TABLE]
where is defined in condition (iii) of the loss. Exponentiating both sides of the inequality and integrating with respect to the prior conditioned on being in yields
[TABLE]
Taking logarithms and integrating with respect to the posterior conditioned on being in yields
[TABLE]
By the definition of and Lemma 2 we have
[TABLE]
Applying inequality (12) to the terms of the final sum above, we see that
[TABLE]
as desired. ∎
Proof of Proposition 7. To begin the proof, define
[TABLE]
By [26, Lemma 2.1], for -almost every , the sequence is tight and all of its limit points are contained in . For a given in this set of full measure, let be such a limit point, with .
Let , and choose such that . Choose a finite measurable partition of such that and (which exists since is compact [46, Lemma 8.5]). By adapting an argument from [46, p. 190] involving subadditivity of measure-theoretic entropy, we obtain that for each , for ,
[TABLE]
where refers to a term that tends to [math] as tends to infinity (for fixed ). Then by letting tend to infinity and applying [26, Lemma 2.1] again, we see that
[TABLE]
where the conditional entropy is defined in (8). To proceed with the proof, we require the following lemma. Recall that at the beginning of this section, we set and .
Lemma 12**.**
Let be any sequence of measures on . For each and define
[TABLE]
If the subsequence converges to , then
[TABLE]
Proof.
By definition of ,
[TABLE]
Then the desired limit follows from the fact that converges to and is continuous. ∎
Combining Lemma 12 with Lemmas 9 and 11, we find that for -almost every
[TABLE]
Letting tend to infinity, we get
[TABLE]
Since was arbitrary, we obtain
[TABLE]
This concludes the proof of Proposition 7.
8. Convergence of Gibbs posterior distributions
The purpose of this section is to establish Theorem 2 concerning convergence of the Gibbs posterior distributions to the solution set of a variational problem. From the dynamics point of view, this convergence highlights the role of the variational problem and the associated equilibirum joinings. We believe these objects to be worthy of further study. From the statistical point of view, this result describes the concentration of posterior distributions, which is of interest in any frequentist analysis of Bayesian methods. The proof follows somewhat directly from Theorem 1.
Proof of Theorem 2. Let be an open neighborhood of . Let , which is closed and therefore compact. If , then for all . Now suppose , and let be the conditional prior on . Let be the common value of for . As is lower semi-continuous and is compact and disjoint from , there exists such that . Now we apply Theorem 1 in two ways: first, with the full parameter set and prior , and second, with in place of and the conditional prior in place of . Let denote the normalizing constant in the second case. Then for -almost every , there exists and such that for all ,
[TABLE]
and for all ,
[TABLE]
Then for all , we have
[TABLE]
Thus, for -almost every , we see that tends to [math].
9. Posterior consistency for Gibbs processes
Here we consider the problem of inference from direct observations of a Gibbs process, as described in Section 2.1. Recall that Gibbs processes allow one to model substantial degrees of dependence, with Markov chains of arbitrarily large order as a special case. In the present setting, we are able to establish posterior consistency (Theorem 3). The first step of the proof involves the application of our main results to show that the posterior distributions concentrate around . Interestingly, the second main step of the proof (showing that ) relies on a celebrated result of Bowen about uniqueness of equilibrium states in dynamics.
Proof of Theorem 3. To begin, let us first establish the connection between the setting of Section 2.1 and the general framework for Gibbs posterior inference in Section 1. Let , , , and be as in Section 2.1. In this particular application, we take to be the trivial mixing SFT, which consists of exactly one point. Intuitively, is unnecessary in this application because we make direct observations of the underlying trajectory (i.e., there is no need for an underlying “hidden” truth). As is trivial in this application, we omit it in our notation. Next, we let the observed system be . Then we define the loss function by setting . Using our regularity assumptions on and , one may easily check that conditions (i)-(iii) are satisfied. We have now specified all the objects necessary for the general framework of Section 1. Let , and for , let be the Gibbs posterior distribution on given observations . We remind the reader that and are formally distinct distributions. Nonetheless, the following lemma shows that they are closely related.
Lemma 13**.**
Let be the uniform Gibbs constant for the family . Then for any Borel set and , for ,
[TABLE]
Proof.
By the uniform Gibbs property, for any , , and , we have
[TABLE]
Let be a Borel set. Integrating the inequality above with respect to yields the inequalities
[TABLE]
Applying these upper and lower bounds to the sets and we find that
[TABLE]
and similarly,
[TABLE]
∎
We require one additional fact before finishing the proof of the theorem.
Lemma 14**.**
Under the present hypotheses, .
Proof.
Recall that is defined as the set of such that , where is the rate function
[TABLE]
As we have chosen to be trivial and in this application, the set of joinings contains only the trivial joining . Hence the definition of ensures that
[TABLE]
where we have used that the fiber entropy of over is trivially zero. To finish the proof, we will show that is minimized if and only if , and therefore .
First suppose that . By the uniqueness of the Gibbs measure (see [7]) and the variational principle for pressure, we have that
[TABLE]
Subtracting from both sides, we obtain
[TABLE]
Then by the variational principle for pressure and this inequality, we have
[TABLE]
Hence , and we conclude that .
Now suppose that . Then by the variational principle for pressure and the fact that ,
[TABLE]
Thus , and since was arbitrary, we have shown that . ∎
We now complete the proof of Theorem 3. Let be an open set such that , and let . By Lemma 14, we have . Hence by Theorem 2, for -almost every , the Gibbs posterior satisfies . Then by Lemma 13, for -almost every , we see that the standard posterior satisfies , as desired.
10. Posterior consistency for hidden Gibbs processes
In this section we establish posterior consistency for hidden Gibbs processes, as in Section 2.2. In addition to modeling substantial dependence with the underlying Gibbs processes, this setting also allows for quite general observational noise models. Note that hidden Markov models with arbitrarily large order appear as a special case in this framework. Here the first part of the proof involves an application of our main results to show that the posterior converges to the set . However, the second part of the proof begins with the well-known fact that the Gibbs measures satisfy large deviations principles (see [48]), and then relies on some recent results from [34] connecting these large deviations properties to the likelihood function in our general observational framework.
Proof of Theorem 4. We begin by placing the setting of Section 2.2 within the general framework of Section 1. Let , , , , , , and be as in Section 2.2. To define the observation space in our general framework, we let . We define the map to be the left-shift, i.e., if , then is the sequence whose -th coordinate is . Furthermore, we define , which is the process measure on described in Section 2.2. Then is an ergodic measure preserving system (see [34, Proposition 6.1] for ergodicity). Now define by . Note that the conditions (i)-(iii) on are satisfied by our assumptions on . Define to be the Gibbs posterior defined as in Section 1. Note that in this setting, if , then the Gibbs posterior is equal to the standard posterior . We require a few lemmas before finishing the proof of the theorem. Before we state the first such lemma, recall that denotes the -integrable function on appearing in property (ii) in Section 1.3.
Lemma 15**.**
Let . Then for each and ,
[TABLE]
Proof.
For notation, let
[TABLE]
First suppose that . Then by Jensen’s inequality and the definition of ,
[TABLE]
Now suppose that . Then
[TABLE]
where we have used that both the logarithm and the exponential are increasing. ∎
Lemma 16**.**
Let . Then
[TABLE]
Proof.
For each , let
[TABLE]
and let . By property (ii), is -integrable and thus the pointwise ergodic theorem ensures that converges for -almost every to the constant . Furthermore, . By Lemma 15, for each . Therefore, by the generalized Lebesgue dominated convergence theorem and the definition of the loss,
[TABLE]
By Theorem 1, the -almost sure limit of is equal to . Combining these facts, we obtain the desired equality. ∎
Lemma 17**.**
Suppose . Then
[TABLE]
Proof.
The well-known large deviations principles for the Gibbs measures [48] imply that they satisfy property (L1) from [34]. By hypothesis, satisfies the regularity of observations property (L2) from [34]. Then results from [34] (in particular Propositions 4.3 and 6.4) yield the desired inequality. ∎
We now proceed with the proof of Theorem 4. Recall that for our choice of loss function ensures that the Bayesian posterior is equal to the Gibbs posterior . By Theorem 2, the Gibbs posterior concentrates -almost surely around the set , defined as the set of such that . Hence concentrates -almost surely around . It remains to show that .
Suppose . Then by Lemmas 16 and 17, we have
[TABLE]
It follows immediately that . For the reverse inclusion, note that if , then , and thus for each ,
[TABLE]
Then Lemma 16 gives that for each . This concludes the proof of Theorem 4.
11. Additional results
In this section we collect some auxiliary results about Gibbs posterior inference. We begin with a converse to Theorem 2 on the exponential scale: if is an open set intersecting , then the Gibbs posterior measure of cannot be exponentially small as tends to infinity.
Proposition 8**.**
Suppose is open and . Then for -almost every ,
[TABLE]
Proof.
Let . By definition of we have . Fix and select sufficiently small that and that the ball of radius around is contained in . Since is fully supported, . Note that for each and ,
[TABLE]
Taking logarithms, dividing by , and letting tend to infinity yields
[TABLE]
As was arbitrary, we obtain the desired result. ∎
We now address the Cesàro convergence of the full posterior on . Recall that we let be the identity map on . In the thermodynamic formalism, invariant measures that achieve the optimal value in the variational expression for pressure are called equilibrium measures. In our setting, we introduce terminology for joinings that achieve the optimal value in the variational expression for the rate function. We will call a joining an equilibrium joining if
[TABLE]
Proposition 9**.**
For each and , let be defined for Borel sets by
[TABLE]
Then for -almost every , all limit points of are -marginals of equilibrium joinings.
Proof.
As in Section 7.2, let
[TABLE]
By definition, is the -marginal of . Let be a weak limit of the subsequence . By repeating the arguments of Section 7.2, one may show that there is a subsequence such that converges weakly to an equilibrium joining . As is necessarily the -marginal of the limit , the proof is complete. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Mahsa Allahbakhshi and Anthony Quas. Class degree and relative maximal entropy. Transactions of the American Mathematical Society , 365(3):1347–1368, 2013.
- 2[2] Masha Allahbakhshi, John Antonioli, and Jisang Yoo. Relative equilibrium states and class degree. Ergodic Theory and Dynamical Systems , pages 1–24, 2017.
- 3[3] José F. Alves, Vanessa Ramos, and Jaqueline Siqueira. Equilibrium stability for non-uniformly hyperbolic systems. Ergodic Theory and Dynamical Systems , pages 1–24, 2018.
- 4[4] John Antonioli. Compensation functions for factors of shifts of finite type. Ergodic Theory and Dynamical Systems , 36(2):375–389, 2016.
- 5[5] Michael Benedicks and Lai-Sang Young. Markov extensions and decay of correlations for certain Hénon maps. Astérisque , 261:13–56, 2000.
- 6[6] Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(5):1103–1130, 2016.
- 7[7] Rufus Bowen. Equilibrium States and the Ergodic Theory of Anosov Diffeomorphisms , volume 470. Springer, Berlin, Heidelberg, 1975.
- 8[8] J-R Chazottes, E Floriani, and R Lima. Relative entropy and identification of Gibbs measures in dynamical systems. Journal of Statistical Physics , 90(3-4):697–725, 1998.
