Gibbs posterior convergence and the thermodynamic formalism

Kevin McGoff; Sayan Mukherjee; Andrew Nobel

arXiv:1901.08641·math.ST·January 28, 2019

Gibbs posterior convergence and the thermodynamic formalism

Kevin McGoff, Sayan Mukherjee, Andrew Nobel

PDF

Open Access

TL;DR

This paper develops a Bayesian inference framework using Gibbs posteriors for dynamical systems, analyzing their asymptotic behavior and establishing connections with thermodynamic formalism to enhance understanding of dependent process inference.

Contribution

It introduces a Gibbs posterior approach for dynamical systems, characterizes its asymptotic behavior, and links Bayesian inference with thermodynamic formalism for dependent processes.

Findings

01

Gibbs posteriors concentrate around solutions of a variational problem.

02

Posterior consistency can be established for properly specified models.

03

Connections between Bayesian inference and thermodynamic formalism are demonstrated.

Abstract

In this paper we consider a Bayesian framework for making inferences about dynamical systems from ergodic observations. The proposed Bayesian procedure is based on the Gibbs posterior, a decision theoretic generalization of standard Bayesian inference. We place a prior over a model class consisting of a parametrized family of Gibbs measures on a mixing shift of finite type. This model class generalizes (hidden) Markov chain models by allowing for long range dependencies, including Markov chains of arbitrarily large orders. We characterize the asymptotic behavior of the Gibbs posterior distribution on the parameter space as the number of observations tends to infinity. In particular, we define a limiting variational problem over the space of joinings of the model system with the observed system, and we show that the Gibbs posterior distributions concentrate around the solution set of…

Equations307

\mathcal{X}=\bigl{\{}x\in\Sigma:\forall i\in\mathbb{Z},\,x_{i+1}\dots x_{i+n}\notin\mathcal{W}\}.

\mathcal{X}=\bigl{\{}x\in\Sigma:\forall i\in\mathbb{Z},\,x_{i+1}\dots x_{i+n}\notin\mathcal{W}\}.

A_{uv}=\left\{\begin{array}[]{ll}1,&\text{if }\exists x\in\mathcal{X}\text{ such that }x_{0}^{n-1}=u\text{ and }x_{1}^{n}=v\\ 0,&\text{otherwise}.\end{array}\right.

A_{uv}=\left\{\begin{array}[]{ll}1,&\text{if }\exists x\in\mathcal{X}\text{ such that }x_{0}^{n-1}=u\text{ and }x_{1}^{n}=v\\ 0,&\text{otherwise}.\end{array}\right.

K^{-1}\leq\frac{\mu\bigl{(}x[0,m-1])\bigr{)}}{\exp\Bigl{(}-\mathcal{P}m+\sum_{k=0}^{m-1}f\bigl{(}S^{k}(x)\bigr{)}\Bigr{)}}\leq K,

K^{-1}\leq\frac{\mu\bigl{(}x[0,m-1])\bigr{)}}{\exp\Bigl{(}-\mathcal{P}m+\sum_{k=0}^{m-1}f\bigl{(}S^{k}(x)\bigr{)}\Bigr{)}}\leq K,

∣ f (x) - f (y) ∣ \leq c d_{X} (x, y)^{r} .

∣ f (x) - f (y) ∣ \leq c d_{X} (x, y)^{r} .

∥ f ∥_{r} = x \in X sup ∣ f (x) ∣ + x \neq = y sup \frac{∣ f ( x ) - f ( y ) ∣}{d _{X} ( x , y ) ^{r}} .

∥ f ∥_{r} = x \in X sup ∣ f (x) ∣ + x \neq = y sup \frac{∣ f ( x ) - f ( y ) ∣}{d _{X} ( x , y ) ^{r}} .

K^{-1}\leq\frac{\mu_{\theta}(x[0,m-1])}{\exp\Bigl{(}-\mathcal{P}(f_{\theta})m+\sum_{k=0}^{m-1}f_{\theta}(S^{k}x)\Bigr{)}}\leq K.

K^{-1}\leq\frac{\mu_{\theta}(x[0,m-1])}{\exp\Bigl{(}-\mathcal{P}(f_{\theta})m+\sum_{k=0}^{m-1}f_{\theta}(S^{k}x)\Bigr{)}}\leq K.

π (θ ∣ \mbox d a t a)

π (θ ∣ \mbox d a t a)

π (θ ∣ \mbox d a t a)

π (θ ∣ \mbox d a t a)

d\bigl{(}(\theta,x),(\theta^{\prime},x^{\prime})\bigr{)}=\max\bigl{(}d_{\Theta}(\theta,\theta^{\prime}),d_{\mathcal{X}}(x,x^{\prime})\bigr{)}.

d\bigl{(}(\theta,x),(\theta^{\prime},x^{\prime})\bigr{)}=\max\bigl{(}d_{\Theta}(\theta,\theta^{\prime}),d_{\mathcal{X}}(x,x^{\prime})\bigr{)}.

\quad\quad\sup\bigl{\{}|\ell(\theta,x,y)-\ell(\theta^{\prime},x^{\prime},y)|:\,d\bigl{(}(\theta,x),(\theta^{\prime},x^{\prime})\bigr{)}\leq\delta\bigl{\}}\leq\rho_{\delta}(y),

\quad\quad\sup\bigl{\{}|\ell(\theta,x,y)-\ell(\theta^{\prime},x^{\prime},y)|:\,d\bigl{(}(\theta,x),(\theta^{\prime},x^{\prime})\bigr{)}\leq\delta\bigl{\}}\leq\rho_{\delta}(y),

ℓ_{n} (θ; x_{0}^{n - 1}, y_{0}^{n - 1}) = k = 0 \sum n - 1 ℓ (θ, x_{k}, y_{k}) .

ℓ_{n} (θ; x_{0}^{n - 1}, y_{0}^{n - 1}) = k = 0 \sum n - 1 ℓ (θ, x_{k}, y_{k}) .

P_{0} (E) = \int\int 1_{E} (θ, x) d μ_{θ} (x) d π_{0} (θ) .

P_{0} (E) = \int\int 1_{E} (θ, x) d μ_{θ} (x) d π_{0} (θ) .

P_{n} (E ∣ y)

P_{n} (E ∣ y)

\displaystyle=\frac{1}{Z_{n}(y)}\int\int\mathbf{1}_{E}(\theta,x)\exp\bigl{(}-\ell_{n}(\theta,x,y)\bigr{)}\,d\mu_{\theta}(x)\,d\pi_{0}(\theta),

Z_{n}(y)=\int\exp\bigl{(}-\ell_{n}(\theta,x,y)\bigr{)}\,dP_{0}(\theta,x).

Z_{n}(y)=\int\exp\bigl{(}-\ell_{n}(\theta,x,y)\bigr{)}\,dP_{0}(\theta,x).

π_{n} (E ∣ y) = P_{n} (E \times X ∣ y) .

π_{n} (E ∣ y) = P_{n} (E \times X ∣ y) .

n lim - \frac{1}{n} lo g Z_{n} (y) = θ \in Θ in f V (θ) .

n lim - \frac{1}{n} lo g Z_{n} (y) = θ \in Θ in f V (θ) .

Θ_{m i n} = θ \in Θ argmin V (θ) .

Θ_{m i n} = θ \in Θ argmin V (θ) .

\lim_{n}\pi_{n}\bigl{(}\Theta\setminus U\mid y\bigr{)}=0.

\lim_{n}\pi_{n}\bigl{(}\Theta\setminus U\mid y\bigr{)}=0.

\pi(\theta\mid x)=\arg\min_{\mu}\biggl{\{}\int_{\theta}\ell(\theta,x)\,d\mu(\theta)+d_{KL}(\mu,\pi)\biggr{\}},

\pi(\theta\mid x)=\arg\min_{\mu}\biggl{\{}\int_{\theta}\ell(\theta,x)\,d\mu(\theta)+d_{KL}(\mu,\pi)\biggr{\}},

Π_{n} (A ∣ y_{0}^{n - 1}) = \frac{\int _{A} μ _{θ} ([ y _{0}^{n - 1} ]) d Π _{0} ( θ )}{\int _{Θ} μ _{θ} ([ y _{0}^{n - 1} ]) d Π _{0} ( θ )} .

Π_{n} (A ∣ y_{0}^{n - 1}) = \frac{\int _{A} μ _{θ} ([ y _{0}^{n - 1} ]) d Π _{0} ( θ )}{\int _{Θ} μ _{θ} ([ y _{0}^{n - 1} ]) d Π _{0} ( θ )} .

\lim_{n}\Pi_{n}\bigl{(}\Theta\setminus U\mid Y_{0}^{n-1}\bigr{)}=0.

\lim_{n}\Pi_{n}\bigl{(}\Theta\setminus U\mid Y_{0}^{n-1}\bigr{)}=0.

\int φ_{θ} (u ∣ x) d m (u) = 1.

\int φ_{θ} (u ∣ x) d m (u) = 1.

\sup_{(\theta,x)\in\Theta\times\mathcal{X}}\int\exp\bigl{(}\beta C(\theta,u)\bigr{)}\varphi_{\theta}(u\mid x)\,dm(u)<\infty.

\sup_{(\theta,x)\in\Theta\times\mathcal{X}}\int\exp\bigl{(}\beta C(\theta,u)\bigr{)}\varphi_{\theta}(u\mid x)\,dm(u)<\infty.

p_{\theta}\bigl{(}u_{0}^{n-1}\mid x\bigr{)}=\prod_{k=0}^{n-1}\varphi_{\theta}\bigl{(}u_{k}\mid S^{k}x\bigr{)},

p_{\theta}\bigl{(}u_{0}^{n-1}\mid x\bigr{)}=\prod_{k=0}^{n-1}\varphi_{\theta}\bigl{(}u_{k}\mid S^{k}x\bigr{)},

p_{\theta}\bigl{(}u_{0}^{n-1}\bigr{)}=\int p_{\theta}\bigl{(}u_{0}^{n-1}\mid x\bigr{)}\,d\mu_{\theta}(x).

p_{\theta}\bigl{(}u_{0}^{n-1}\bigr{)}=\int p_{\theta}\bigl{(}u_{0}^{n-1}\mid x\bigr{)}\,d\mu_{\theta}(x).

Π_{n} (E ∣ u_{0}^{n - 1}) = \frac{\int _{E} p _{θ} ( u _{0}^{n - 1} ) d Π _{0} ( θ )}{\int _{Θ} p _{θ} ( u _{0}^{n - 1} ) d Π _{0} ( θ )} .

Π_{n} (E ∣ u_{0}^{n - 1}) = \frac{\int _{E} p _{θ} ( u _{0}^{n - 1} ) d Π _{0} ( θ )}{\int _{Θ} p _{θ} ( u _{0}^{n - 1} ) d Π _{0} ( θ )} .

\lim_{n}\Pi_{n}\bigl{(}\Theta\setminus E\mid Y_{0}^{n-1}\bigr{)}=0,\quad\quad\mathbb{P}^{U}_{\theta^{*}}-\text{a.s.}

\lim_{n}\Pi_{n}\bigl{(}\Theta\setminus E\mid Y_{0}^{n-1}\bigr{)}=0,\quad\quad\mathbb{P}^{U}_{\theta^{*}}-\text{a.s.}

J (R_{0} : η_{1}) = η_{0} \in M (U_{0}, R_{0}) ⋃ J (η_{0}, η_{1}),

J (R_{0} : η_{1}) = η_{0} \in M (U_{0}, R_{0}) ⋃ J (η_{0}, η_{1}),

H (η, α) = - C \in ξ \sum η (C) lo g η (C),

H (η, α) = - C \in ξ \sum η (C) lo g η (C),

\bigvee_{k=0}^{n}\alpha^{k}=\bigl{\{}A_{0}\cap\dots\cap A_{n}:A_{i}\in\alpha^{i}\}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Bayesian Methods and Mixture Models · Statistical Methods and Inference

Full text

Gibbs posterior convergence and the thermodynamic formalism

Kevin McGoff

UNC Charlotte

[email protected]

,

Sayan Mukherjee

Duke University

[email protected]

and

Andrew Nobel

UNC Chapel Hill

[email protected]

Abstract.

In this paper we consider a Bayesian framework for making inferences about dynamical systems from ergodic observations. The proposed Bayesian procedure is based on the Gibbs posterior, a decision theoretic generalization of standard Bayesian inference. We place a prior over a model class consisting of a parametrized family of Gibbs measures on a mixing shift of finite type. This model class generalizes (hidden) Markov chain models by allowing for long range dependencies, including Markov chains of arbitrarily large orders. We characterize the asymptotic behavior of the Gibbs posterior distribution on the parameter space as the number of observations tends to infinity. In particular, we define a limiting variational problem over the space of joinings of the model system with the observed system, and we show that the Gibbs posterior distributions concentrate around the solution set of this variational problem. In the case of properly specified models our convergence results may be used to establish posterior consistency. This work establishes tight connections between Gibbs posterior inference and the thermodynamic formalism, which may inspire new proof techniques in the study of Bayesian posterior consistency for dependent processes.

Kevin McGoff would like to acknowledge funding from NSF grant DMS 1613261.

Sayan Mukherjee would like to acknowledge funding from NSF DEB-1840223, NIH R01 DK116187-01, HFSP RGP0051/2017, NSF DMS 17-13012, and NSF DMS 16-13261.

Andrew Nobel would like to acknowledge funding from NSF DMS-1613261, NSF DMS-1613072, NIH R01 HG009125-01.

1. Introduction

In this work we establish asymptotic results concerning Bayesian inference for certain dynamical systems. We consider a fairly general framework in which both the observations and the fitted models arise from dynamical systems. Our analysis brings together two distinct strands of research, both of which were originally inspired by connections with statistical physics: the thermodynamic formalism in dynamical systems, and the Gibbs posterior principle for Bayesian inference in statistics. Our work highlights the substantial connections between these two areas and shows that together they produce a natural framework for Bayesian inference about dynamical systems. Our general results guarantee the concentration of Gibbs posterior distributions around certain sets of parameters that are characterized by a variational principle. As applications of these general results, we also establish posterior consistency results for some classes of dynamical models, generalizing previous posterior consistency results for Markov and hidden Markov models in Bayesian nonparametrics.

1.1. Observed system

Our inference framework consists of two main components. The first component is an observed dynamical system, defined as follows. Let $\mathcal{Y}$ be a complete separable metric space. Here and throughout this work we assume that all such spaces are endowed with their Borel $\sigma$ -algebras, and we suppress this choice in our notation. Let $T:\mathcal{Y}\to\mathcal{Y}$ be a Borel measurable map. We let $\mathcal{M}(\mathcal{Y})$ denote the set of Borel probability measures on $\mathcal{Y}$ , endowed with the weak∗ topology on measures. For $\nu\in\mathcal{M}(\mathcal{Y})$ , we say that $\nu$ is invariant under $T$ if $\nu(T^{-1}E)=\nu(E)$ for all Borel sets $E\subset\mathcal{Y}$ . The set of $T$ -invariant measures in $\mathcal{M}(\mathcal{Y})$ is denoted by $\mathcal{M}(\mathcal{Y},T)$ . Furthermore, we say that $\nu\in\mathcal{M}(\mathcal{Y},T)$ is ergodic if $\nu(E)\in\{0,1\}$ for all Borel sets $E$ satisfying $T^{-1}(E)=E$ . Our standard assumption is that the observed system has the form $(\mathcal{Y},T,\nu)$ , where $\nu\in\mathcal{M}(\mathcal{Y},T)$ is ergodic.

1.2. Model families

The second component of our inference framework is a collection of models. In order to model dynamics in the standard statistical setting, one typically considers (hidden) Markov models or more complex state space models. In our analysis we would like to be able to handle model processes with long range dependencies, and so we consider a general class of models known as Gibbs measures. This class of models strictly generalizes the class of finite state Markov models with arbitrarily large order.

Before giving a precise definition of a Gibbs measure, we must first introduce the underlying state space for such models, which is called a mixing shift of finite type (SFT). A shift of finite type is a dynamical system that is the topological analogue of a finite state aperiodic and irreducible Markov chain. SFTs have been widely studied in the dynamical systems literature, both for their own sake [32] and as model systems for some smooth systems such as Axiom A diffeomorphisms [7]. Furthermore, SFTs have substantial connections to statistical physics and other fields such as coding and information theory [32, 41].

Here we give a proper definition for a mixing SFT. Let $\mathcal{A}$ be a finite set, known as an alphabet, and let $\Sigma=\mathcal{A}^{\mathbb{Z}}$ be the set of bi-infinite sequences $x=(x_{n})$ with values in $\mathcal{A}$ . Define the left-shift map $\sigma:\Sigma\to\Sigma$ by $\sigma(x)_{n+1}=x_{n}$ . A set $\mathcal{X}$ is called an SFT if there exists $n\geq 0$ and a collection of words $\mathcal{W}\subset\mathcal{A}^{n}$ such that $\mathcal{X}$ is exactly the set of sequences in $\Sigma$ that contain no words from $\mathcal{W}$ :

[TABLE]

Here $\mathcal{W}$ is called a set of forbidden words for $\mathcal{X}$ . Note that by choosing $\mathcal{W}=\varnothing$ , one obtains the full sequence space $\Sigma$ , which is known as the full shift (on the alphabet $\mathcal{A}$ ). Also, we endow $\mathcal{A}$ with the discrete topology and $\Sigma$ with the product topology, which makes any such $\mathcal{X}$ closed and compact. We define the map $S:\mathcal{X}\to\mathcal{X}$ to be the restriction of the left shift $\sigma$ to $\mathcal{X}$ . Let $\mathcal{L}_{m}$ denote the set of words of length $m$ (i.e., elements of $\mathcal{A}^{m}$ ) that appear in at least one point of $\mathcal{X}$ , and let $\mathcal{L}=\cup_{m\geq 0}\mathcal{L}_{m}$ . An SFT $\mathcal{X}$ is said to be mixing if for any two words $u,v\in\mathcal{L}$ , there exists $N$ such that for all $m\geq N$ , there exists a word $w\in\mathcal{L}_{m}$ such that $uwv\in\mathcal{L}$ . The following equivalent definition is perhaps more intuitive to readers familiar with Markov chains. Let $A$ be the square matrix indexed by $\mathcal{A}^{n}$ defined for for words $u,v\in\mathcal{A}^{n}$ by the rule

[TABLE]

Then $\mathcal{X}$ is mixing if and only if there exists $N\geq 1$ such that $A^{N}$ contains all positive entries. Our standard assumption on $\mathcal{X}$ is that it is a mixing SFT.

To model stochastic behavior on the topological system $(\mathcal{X},S)$ , we consider a family of $S$ -invariant probability measures on $\mathcal{X}$ , called Gibbs measures. To introduce Gibbs measures, one begins with a function $f:\mathcal{X}\to\mathbb{R}$ , which is called a potential function (or just a potential). A Borel probability measure $\mu$ on $\mathcal{X}$ is said to be a Gibbs measure corresponding to the potential function $f:\mathcal{X}\to\mathbb{R}$ if there exists constants $\mathcal{P}\in\mathbb{R}$ and $K>0$ such that for all $x\in\mathcal{X}$ and $m\geq 1$ ,

[TABLE]

where $x[0,m-1]$ is the cylinder set of points $y$ in $\mathcal{X}$ such that $x_{i}=y_{i}$ for all $i=0,\dots,m-1$ . The property in (1) is called the Gibbs Property. By a celebrated result of Bowen [7], under mild regularity conditions on $f$ , there is a unique Gibbs measure $\mu\in\mathcal{M}(\mathcal{X},S)$ with potential function $f$ , and furthermore the measure $\mu$ is ergodic. The constant $\mathcal{P}=\mathcal{P}(f)$ is called the pressure of $f$ .

The Gibbs measure is a generalization of the canonical ensemble in statistical physics to infinite systems. Potential functions have natural connections with Hamiltonians in the study of lattice systems in statistical physics. In considering inference, we will think of loss functions as potential functions. We remark (again) that the class of Gibbs measures strictly generalizes the class of Markov chains, allowing for arbitrarily long dependencies. Indeed, any Markov chain of order $k$ on the alphabet $\mathcal{A}$ can be realized as a Gibbs measure by an appropriate choice of a potential function that depends on only $k$ coordinates. On the other hand, when the potential function $f$ depends on infinitely many coordinates, the corresponding Gibbs measure is not Markov of any order. In this way, our model families may include Markov chains with unbounded orders, which highlights the degree of dependence allowed by our framework.

Lastly, let us mention the regularity condition that we require our model families to satisfy. For points $x,y$ in $\mathcal{X}$ , we let $n(x,y)$ denote the infimum of all $|m|$ such that $x_{m}\neq y_{m}$ . Then we define a metric $d(\cdot,\cdot)$ on $\mathcal{X}$ by setting $d_{\mathcal{X}}(x,y)=2^{-n(x,y)}$ . For $r>0$ , we let $C^{r}(\mathcal{X})$ denote the set of continuous functions from $\mathcal{X}$ to $\mathbb{R}$ with Hölder exponent $r$ , that is, the set of functions $f:\mathcal{X}\to\mathbb{R}$ for which there exists a constant $c$ such that for all $x,y\in\mathcal{X}$ ,

[TABLE]

Furthermore, we endow $C^{r}(\mathcal{X})$ with the topology induced by the norm $\|\cdot\|_{r}$ , where

[TABLE]

Now we define the regularity condition necessary for our model families.

Definition 1.

Let $\Theta$ be compact metric space. A parametrized family of potential functions $\mathcal{F}=\{f_{\theta}:\theta\in\Theta\}$ will be called a regular family if there exists $r>0$ such that $\mathcal{F}\subset C^{r}(\mathcal{X})$ and the map $\theta\to f_{\theta}$ is continuous in the topology induced by the norm $\|\cdot\|_{r}$ .

If a family $\{f_{\theta}:\theta\in\Theta\}$ is a regular family, then the map $\theta\mapsto\mu_{\theta}$ is continuous in the weak∗ topology on measures, and the constants $K(f_{\theta})$ and $\mathcal{P}(f_{\theta})$ that appear in (1) depend continuously on $\theta$ (see [3]). Furthermore, since $\Theta$ is compact, we get a uniform Gibbs property: there exists a uniform constant $K$ and a continuous function $\theta\mapsto\mathcal{P}(f_{\theta})$ such that for all $\theta\in\Theta$ , $x\in\mathcal{X}$ , and $m\geq 1$ ,

[TABLE]

We assume throughout that $\mathcal{F}=\{f_{\theta}:\theta\in\Theta\}$ is a regular family of potential functions, and our model class consists of the corresponding parametrized family of Gibbs measures $\{\mu_{\theta}:\theta\in\Theta\}$ .

1.3. Inference

The inference paradigm we consider is known as Gibbs posterior inference, which is a generalization of the standard Bayesian inference framework. The basic idea behind the Gibbs posterior [6, 25] is to replace the likelihood with an exponentiated loss or utility function in the standard Bayesian procedure for updating beliefs about an unknown parameter of interest $\theta$ . Whereas the standard Bayes posterior takes the form

[TABLE]

the Gibbs posterior has the form

[TABLE]

where $\ell(\mbox{data},\theta)$ is the loss associated with $\theta$ based on the observed data. When the loss function is the negative log-likelihood then the two paradigms are identical. The original motivation for the Gibbs posterior was to specify a coherent procedure for Bayesian inference when the parameter of interest is connected to observations via a loss function, rather than the classical setting where the likelihood or true sampling distribution is known; see [6] for more arguments in favor of the Gibbs posterior and discussion about how the Gibbs posterior framework addresses model misspecification and robustness to nuisance parameters. Note that in the general Gibbs posterior framework without a likelihood, there is no generative model assumed for the observations.

We consider models indexed by a compact metric space $\Theta$ (with metric denoted $d_{\Theta}$ ), which will serve as a parameter space. The elements of $\Theta$ will be used to parametrize both the dependence structure of the Gibbs measures in our model class (e.g., transition probabilities) and the relationship between states and observations (e.g., emission probabilities). Recall that as part of our standard assumptions, we assume that our model class $\{\mu_{\theta}:\theta\in\Theta\}$ is a family of Gibbs measures on $\mathcal{X}$ corresponding to a regular family of potential functions (as in Definition 1).

Also recall that the observed system has state space $\mathcal{Y}$ with invariant measure $\nu$ . Define a metric on $\Theta\times\mathcal{X}$ by the rule

[TABLE]

Here and throughout this work, we assume that we have a loss function $\ell:\Theta\times\mathcal{X}\times\mathcal{Y}\to\mathbb{R}$ satisfying the following conditions:

(i)

$\ell$ is continuous; 2. (ii)

there exists a measurable function $\ell^{*}:\mathcal{Y}\to\mathbb{R}$ such that for all $y\in\mathcal{Y}$ , $\sup_{\theta,x}|\ell(\theta,x,y)|\leq\ell^{*}(y)$ , and $\int\ell^{*}\,d\nu<\infty$ ; 3. (iii)

for each $\delta>0$ there exists a measurable function $\rho_{\delta}:\mathcal{Y}\to(0,\infty)$ such that for each $y\in\mathcal{Y}$ ,

[TABLE]

and $\lim_{\delta\to 0^{+}}\int\rho_{\delta}\,d\nu=0$ .

Condition (ii) is an integrability condition on the loss, while condition (iii) is a requirement on the modulus of continuity of the loss. In Section 1.5 we provide examples of loss functions satisfying these conditions. Note that including the parameter $\theta$ in the loss function may be considered non-standard in statistics. However, this formulation will simplify notation throughout the paper, and in Section 1.5 we establish that this setting is equivalent to the standard one. Also note that the dependence of the loss on $\Theta$ and on the uncountable space $\mathcal{X}$ allows us to model continuous observations and emission probabilities.

With the loss function and parameter $\theta\in\Theta$ fixed, we define the loss of the finite sequence $x_{0}^{n-1}\in\mathcal{X}^{n}$ with respect to a finite sequence of observations $y_{0}^{n-1}\in\mathcal{Y}^{n}$ to be the sum of the per-state losses:

[TABLE]

When $x_{0}^{n-1}=(x,Sx,\dots,S^{n-1}x)$ and $y_{0}^{n-1}=(y,Ty,\dots,T^{n-1}y)$ are initial segments of trajectories of $S$ and $T$ , respectively, we write $\ell_{n}(\theta,x,y)$ instead of $\ell_{n}(\theta;x_{0}^{n-1},y_{0}^{n-1})$ .

Let us now give the definition of Gibbs posterior distributions on $\Theta$ . Here we consider the subjective case, in which one begins with a prior distribution on $\Theta$ . Let $\pi_{0}$ be a fully supported Borel probability measure on $\Theta$ , which will serve as our prior distribution. First we extend $\pi_{0}$ to form a prior distribution on $\Theta\times\mathcal{X}$ as follows. Given the family $\{\mu_{\theta}:\theta\in\Theta\}$ of Gibbs measures on $\mathcal{X}$ , consider the induced prior distribution $P_{0}$ on $\Theta\times\mathcal{X}$ defined for any Borel set $E\subset\Theta\times\mathcal{X}$ by

[TABLE]

According to the Gibbs posterior paradigm [6, 25], if we make observations $(y,Ty,\dots,T^{n-1}y)\in\mathcal{Y}^{n}$ , then our updated beliefs should be represented by the Gibbs posterior distribution. This distribution is the Borel probability measure $P_{n}(\cdot\mid y)$ on $\Theta\times\mathcal{X}$ defined for Borel sets $E\subset\Theta\times\mathcal{X}$ by

[TABLE]

where $Z_{n}(y)$ is the normalizing constant (partition function), given by

[TABLE]

Then the Gibbs posterior distribution $\pi_{n}(\cdot\mid y)$ on $\Theta$ is simply the $\Theta$ -marginal of $P_{n}(\cdot\mid y)$ , which is defined for Borel sets $E\subset\Theta$ by

[TABLE]

As we are considering a Bayesian framework, all inference about the parameters $\theta\in\Theta$ based on the observations $(y,Ty,\dots,T^{n-1}y)$ is derived from the posterior $\pi_{n}(\cdot\mid y)$ . We focus here on inference regarding the parameters in $\Theta$ , since inference regarding the initial condition $x$ in $\mathcal{X}$ is known to be impossible for many dynamical systems, including shifts of finite type [28, 29]. Let us summarize our framework.

•

We begin with a fully supported prior $\pi_{0}$ on a compact set $\Theta$ that smoothly parametrizes a family of Gibbs measures $\{\mu_{\theta}:\theta\in\Theta\}$ on $\mathcal{X}$ .

•

From $\pi_{0}$ and $\{\mu_{\theta}:\theta\in\Theta\}$ , we create an extended prior $P_{0}$ on $\Theta\times\mathcal{X}$ .

•

We obtain observations $y,\dots,T^{n-1}y$ in $\mathcal{Y}$ from a stationary ergodic process $(\mathcal{Y},T,\nu)$ .

•

From $P_{0}$ , the observations, and the loss function $\ell$ we obtain the Gibbs posterior $P_{n}$ on $\Theta\times\mathcal{X}$ .

•

Finally, we marginalize $P_{n}$ to get the posterior $\pi_{n}$ on $\Theta$ .

1.4. Main results

Our analysis begins with an examination of the exponential growth rate of the (random) partition function $Z_{n}$ for large $n$ . In particular, we establish a variational principle for the almost sure limit of $n^{-1}\log Z_{n}$ as $n$ tends to infinity.

Theorem 1.

Under the standard assumptions stated above there exists a lower semicontinuous function $V:\Theta\to\mathbb{R}$ such that for $\nu$ -almost every $y\in\mathcal{Y}$ ,

[TABLE]

Remark 1.

The compactness of $\Theta$ and lower semicontinuity of $V$ ensure that the infimum in Theorem 1 is obtained. The conclusion of the theorem is similar to a large deviations principle (see for example [15]), with $V:\Theta\to\mathbb{R}$ playing the role of the rate function. For this reason, we refer to $V$ as the rate function in this setting. A detailed discussion of $V$ appears in Section 3, where we show that $V$ can be expressed as the sum of an expected loss term and a divergence term.

The variational expression that appears in Theorem 1 suggests that we focus on the (non-empty, compact) set of parameters $\theta$ that minimize this expression. Let

[TABLE]

In our second main result, we establish that the Gibbs posterior distribution must concentrate around this set.

Theorem 2.

For any open neighborhood $U$ of $\Theta_{\min}$ , for $\nu$ -almost every $y\in\mathcal{Y}$ , we have

[TABLE]

In light of this result, it is possible to answer questions about Gibbs posterior consistency by analyzing the variational problem defining $\Theta_{\min}$ . We illustrate this approach to posterior consistency in several applications (see Section 2).

Remark 2 (Optimality of $\Theta_{\min}$ ).

One may wonder whether $\pi_{n}(\cdot\mid y)$ actually concentrates around a strict subset of $\Theta_{\min}$ . Proposition 8 addresses this question on the exponential scale. It states that if $U\subset\Theta$ is open and intersects $\Theta_{\min}$ , then the posterior probability of $U$ cannot be exponentially small as $n$ tends to infinity, i.e., for $\nu$ -almost every $y$ , the quantity $n^{-1}\log\pi_{n}(U\mid y)$ tends to zero as $n$ tends to infinity.

Remark 3 (Ground states and MAP).

From a thermodynamic perspective, it is natural to introduce an inverse temperature parameter $\beta\in\mathbb{R}$ and consider the new loss function $\ell_{\beta}(\theta,x,y)=\beta\cdot\ell(\theta,x,y)$ . In this setting, one would like to understand what happens as $\beta$ tends to infinity. In Section 3.7, we identify the limit of both $V$ and $\Theta_{\min}$ as $\beta$ tends to infinity in terms of variational problems considered in previous work [36].

The use of an inverse temperature parameter has also been used in practice to perform maximum a posteriori (MAP) estimation. MAP estimation is a common alternative to fully Bayesian inference that is used in both statistics and machine learning. It involves finding the parameter that is the posterior mode. The motivation for MAP estimation is often computational efficiency and the lack of a need for uncertainty quantification. The idea of adding an inverse temperature parameter ( $\beta$ ) to a Gibbs distribution for MAP estimation was introduced for Bayesian models in a seminal paper by Geman and Geman [18], who also gave an annealing schedule to increase the inverse temperature with a provable guarantee for finding the posterior mode.

Remark 4 (Connections to penalization).

The formulation of Bayesian updating as a variational problem with an entropic penalty has been previously explored [6, 51], and these ideas are related to Jaynes’ maximum entropy formulation of Bayesian inference [23]. In both [6] and [51], posterior inference was formulated as follows: given a loss function $\ell(\theta,x)$ and a prior $\pi$ , the posterior distribution is

[TABLE]

where $d_{KL}(\mu,\pi)$ is the relative entropy between $\mu$ and $\pi$ . The function being minimized above has close connections to the function $V(\theta)$ in Theorem 1; see Definition 3 below.

Remark 5 (Convergence of full Gibbs posteriors).

Our main results establish the concentration of the $\Theta$ -marginal posterior distributions $\pi_{n}(\cdot\mid y)$ around the limit set $\Theta_{\min}$ . In contrast, the $\mathcal{X}$ -marginal of the full posterior distribution $P_{n}(\cdot\mid y)$ need not concentrate around any particular subset of $\mathcal{X}$ (according to the negative results of [28, 29]). Nonetheless, Proposition 9 gives a characterization of any Cesàro limit of the full posteriors.

Remark 6 (Importance of the Gibbs property).

The Gibbs property (1) of the measures $\mu_{\theta}$ makes them particularly suitable as model distributions for the purposes of Gibbs posterior inference. In general, invariant measures for dynamical systems do not admit such exponential estimates, and it is precisely these estimates that allow us to carry out our asymptotic analysis.

Remark 7 (Continuous model systems).

Results similar to those here may be established for certain families of differential dynamical systems on manifolds. In particular, using our results and the well-known connections between SFTs and Axiom A systems (see [7]), it is possible to establish analogous conclusions for Axiom A diffeomorphisms with Gibbs measures.

1.5. Examples of inference settings and associated loss functions

Here we describe some possible inference settings that fit into our framework. Note that these settings give examples of loss functions that satisfy conditions (i)-(iii).

Example 1.

(Continuous, deterministic observations) Suppose that the state space $\mathcal{Y}$ of the observed system is a subset of the real line, so that the observations $y,Ty,T^{2}y,\ldots$ are real-valued and deterministic. In this case, we might wish to fit the observations to a family of models generated by a family $\{\mu_{\theta}:\theta\in\Theta\}$ of Gibbs measures on a fixed mixing SFT $\mathcal{X}$ , with associated prior $\pi_{0}$ , and a continuously parametrized family $\{\varphi_{\theta}:\mathcal{X}\to\mathbb{R}\}$ of continuous observation functions. Given $\theta$ and $x$ , the initial part of the real-valued sequence $\{\varphi_{\theta}(S^{k}x)\}_{k\geq 0}$ can be fit to the the observations. Models of this sort are called dynamical models, and they have been studied in the context of empirical risk minimization in [37]. If the measure $\nu$ has finite second moment, and $\ell$ is the squared loss $\ell(\theta,x,y)=|\varphi_{\theta}(x)-y|^{2}$ , then conditions (i)-(iii) on the loss are satisfied.

Example 2.

(Discrete observations) Let $\mathcal{A}$ and $\mathcal{B}$ be finite sets. We suppose that we make $\mathcal{B}$ -valued observations, that is, $\mathcal{Y}\subset\mathcal{B}^{\mathbb{Z}}$ , and we wish to model these observations with a family of Gibbs measures $\{\mu_{\theta}:\theta\in\Theta\}$ on a mixing SFT $\mathcal{X}$ with $\mathcal{X}\subset\mathcal{A}^{\mathbb{Z}}$ . Further, let $\varphi:\mathcal{A}\to\mathcal{B}$ be an observation function, so that a point $x$ in $\mathcal{X}$ gives rise to the $\mathcal{B}$ -valued sequence $\{\varphi(x_{k})\}_{k\geq 0}$ . Let $\ell$ be the discrete loss, $\ell(\theta,x,y)=\mathbf{1}(\varphi(x_{0})\neq y_{0})$ . Then the conditions (i)-(iii) on the loss are satisfied.

Example 3.

(Family of conditional likelihoods) Suppose that $\{p(\cdot\mid x,\theta):\theta\in\Theta,x\in\mathcal{X}\}$ is a family of conditional densities on $\mathcal{Y}$ with respect to a common Borel measure $m$ on $\mathcal{Y}$ . Here $p(\cdot\mid x,\theta)$ is the conditional likelihood of a single observation given the parameter $\theta$ and system state $x$ . Under appropriate continuity and integrability conditions on the family of likelihoods, the negative log-likelihood function, $\ell(\theta,x,y)=-\log p(y\mid x,\theta)$ , satisfies conditions (i)-(iii). In this situation, the Gibbs posterior is the same as the standard Bayes posterior. Furthermore, the dependence of the loss on the parameter $\theta$ allows one to parametrize the conditional observation densities, as in the parametrization of emission densities in the study of hidden Markov models. Note that in the Gibbs posterior framework, the true observation system $(\mathcal{Y},T,\nu)$ may be fully misspecified–it need not be related to any of the generative processes implied by the family of Gibbs measures and conditional likelihoods.

2. Applications

In this section we present two applications of our main results on Gibbs posterior consistency to standard posterior consistency for two models. In the first, we establish Bayesian posterior consistency for direct observations of Gibbs processes. Interestingly, the proof of this result may be reduced to a classical result of Bowen on uniqueness of equilibrium states. In the second application, we establish Bayesian posterior consistency for hidden Gibbs processes. This result generalizes previous results on posterior consistency for hidden Markov models by allowing substantially more dependence in the hidden processes, including families of Markov chains with unbounded orders.

2.1. Direct observations of Gibbs processes

Let $\mathcal{Z}$ be a mixing SFT, and let $\{f_{\theta}:\theta\in\Theta\}\subset C^{r}(\mathcal{Z})$ be a regular family of Hölder potential functions (as in Definition 1). We let $\mu_{\theta}$ be the unique Gibbs measure associated with $f_{\theta}$ . Additionally, let $\Pi_{0}$ be a fully supported prior distribution on $\Theta$ .

Here we consider the properly specified situation with direct observations. That is, we suppose that there exists $\theta^{*}\in\Theta$ such that our observations $Y_{0},Y_{1},\dots$ are the coordinates of the ergodic system $(\mathcal{Z},\sigma|_{\mathcal{Z}},\mu_{\theta^{*}})$ , observed without noise. In other words, the observation $Y_{k}$ at time $k\geq 0$ is the $k$ th coordinate of the stationary ergodic process $\mathbf{Y}=\{Y_{k}\}_{k\in\mathbb{Z}}\in\mathcal{Z}$ with distribution $\mu_{\theta^{*}}$ . In this case, the likelihood of observing $y_{0}\dots y_{n-1}$ under $\theta$ is simply given by $\mu_{\theta}([y_{0}^{n-1}])$ , where $[y_{0}^{n-1}]=\{z\in\mathcal{Z}:z_{0}^{n-1}=y_{0}^{n-1}\}$ . Let $\Pi_{n}(\cdot\mid y_{0}^{n-1})$ be the standard Bayesian posterior distribution on $\Theta$ given the observations $y_{0}^{n-1}$ , i.e., for Borel sets $A\subset\Theta$ ,

[TABLE]

Let $[\theta^{*}]$ denote the identifiability class of $\theta^{*}$ , which is naturally defined as the set of all $\theta$ such that $\mu_{\theta}=\mu_{\theta^{*}}$ . The following theorem states that $\Pi_{n}(\cdot\mid Y_{0}^{n-1})$ concentrates around $[\theta^{*}]$ , establishing posterior consistency in this setting.

Theorem 3.

Let $U\subset\Theta$ be an open neighborhood of $[\theta^{*}]$ . Then with probability one (with respect to $\mu_{\theta^{*}}$ ), we have

[TABLE]

The proof of Theorem 3 is based on the principal results above. In particular, we apply Theorem 2 to establish that the posterior concentrates around a set $\Theta_{\min}$ , which is characterized as the solution set of a variational problem. We then use the variational characterization of $\Theta_{\min}$ and additional problem-specific arguments to show that $\Theta_{\min}$ is equal to the identifiability class $[\theta^{*}]$ . In this example, these problem-specific arguments rely on one of the foundational results of Bowen [7] in the thermodynamic formalism, namely the uniqueness of equilibrium states for Hölder continuous potentials on a mixing SFT.

2.2. Hidden Gibbs processes

In this section we consider posterior consistency for more general observation processes. Let $\mathcal{X}$ be a mixing SFT, $\{f_{\theta}:\theta\in\Theta\}\subset C^{r}(\mathcal{X})$ a regular family of Hölder potential functions (as in Definition 1), and $\{\mu_{\theta}:\theta\in\Theta\}$ the corresponding family of Gibbs measures. Let $\Pi_{0}$ be any fully supported prior distribution on $\Theta$ .

The novel feature of the present setting is that we allow for general observations of the underlying family. Suppose that $m$ is a $\sigma$ -finite Borel measure on a complete separable metric space $\mathcal{U}$ , and that $\varphi:\Theta\times\mathcal{X}\times\mathcal{U}\to[0,\infty)$ is a jointly continuous function such that for all $\theta\in\Theta$ and $x\in\mathcal{X}$ ,

[TABLE]

We regard $\{\varphi_{\theta}(\cdot\mid x):\theta\in\Theta,\,x\in\mathcal{X}\}$ as a family of conditional likelihoods for $u\in\mathcal{U}$ given $\theta$ and $x$ . We assume that the function $L:\Theta\times\mathcal{X}\times\mathcal{U}\to\mathbb{R}$ given by $L(\theta,x,u)=-\log\varphi_{\theta}(u\mid x)$ satisfies the integrability and regularity conditions (i)-(iii) from Section 1.3. Furthermore, we require condition (L2) from [34], which stipulates that there exists $\alpha>0$ and a Borel measurable function $C:\Theta\times\mathcal{U}\to[0,\infty)$ such that for each $(\theta,u)\in\Theta\times\mathcal{U}$ , the function $L(\theta,\cdot,u):\mathcal{X}\to\mathbb{R}$ is $\alpha$ -Hölder continuous with constant $C(\theta,u)$ , and for each $\beta>0$ ,

[TABLE]

This condition may be viewed as a condition on the regularity of the conditional density functions; it is used in [34] to control the likelihood function in the large deviations regime.

With these conditions in place, we assume that the conditional likelihood of observing $u_{0}^{n-1}\in\mathcal{U}^{n}$ given $(\theta,x)\in\Theta\times\mathcal{X}$ is

[TABLE]

and that the likelihood of observing $u_{0}^{n-1}\in\mathcal{U}^{n}$ given $\theta\in\Theta$ is

[TABLE]

In other words, for each $\theta\in\Theta$ , we have an observed sequence $U_{0},U_{1},\ldots$ generated as follows: select $X\in\mathcal{X}$ according to $\mu_{\theta}$ and for each $k\geq 0$ let $U_{k}\in\mathcal{U}$ have density $\varphi_{\theta}(\cdot\mid S^{k}X)$ with respect to $m$ . Denote by $\mathbb{P}^{U}_{\theta}$ the process measure for the process $\{U_{k}\}$ , which has likelihood $p_{\theta}$ .

Now let $\Pi_{n}(\cdot\mid u_{0}^{n-1})$ be the standard Bayesian posterior distribution on $\Theta$ given observations $u_{0}^{n-1}$ based on the prior $\Pi_{0}$ and the likelihood $p_{\theta}$ : for Borel sets $E\subset\Theta$ ,

[TABLE]

We again consider the properly specified case, in which there exists a parameter $\theta^{*}\in\Theta$ such that the observed process $\{Y_{n}\}$ is drawn from $\mathbb{P}_{\theta^{*}}^{U}$ . In order to address posterior consistency, we define the identifiability class of $\theta^{*}$ , denoted $[\theta^{*}]$ , to be the set of $\theta\in\Theta$ such that $\mathbb{P}^{U}_{\theta}=\mathbb{P}^{U}_{\theta^{*}}$ ; in other words, a parameter is in $[\theta^{*}]$ if its associated process has the same distribution as the process generated by $\theta^{*}$ . The following result establishes posterior consistency in this setting.

Theorem 4.

Let $E\subset\Theta$ be an open neighborhood of $[\theta^{*}]$ . Then

[TABLE]

The proof of Theorem 4 is based on the principal results above. In particular, we use these results to establish convergence of the posterior distribution, and then we use problem specific arguments to prove that the limit set $\Theta_{\min}$ is equal to the identifiability set $[\theta^{*}]$ . In this case, the problem-specific arguments rely on previously studied connections between large deviations for Gibbs measures and identifiability of observed systems [34].

3. Joinings, divergence, and the rate function

In this section we discuss the rate function $V:\Theta\to\mathbb{R}$ , whose existence is asserted by Theorem 1. In order to provide a thorough discussion, we first recall some background material from ergodic theory, including joinings and fiber entropy.

3.1. Joinings

Joinings were introduced by Furstenberg [16], and they have played an important role in the development of ergodic theory (see [11, 20]). Suppose $(\mathcal{U}_{0},R_{0},\eta_{0})$ and $(\mathcal{U}_{1},R_{1},\eta_{1})$ are two probability measure-preserving Borel systems with $R_{i}:\mathcal{U}_{i}\to\mathcal{U}_{i}$ and $\eta_{i}\in\mathcal{M}(\mathcal{U}_{i},R_{i})$ . The product transformation $R_{0}\times R_{1}:\mathcal{U}_{0}\times\mathcal{U}_{1}\to\mathcal{U}_{0}\times\mathcal{U}_{1}$ is defined by $(R_{0}\times R_{1})(u,v)=(R_{0}(u),R_{1}(v))$ . A joining of these two systems is a Borel probability measure $\lambda$ on $\mathcal{U}_{0}\times\mathcal{U}_{1}$ with marginal distributions $\eta_{0}$ and $\eta_{1}$ that is invariant under the product transformation $R_{0}\times R_{1}$ . Thus, a joining is a coupling of the measures $\eta_{0}$ and $\eta_{1}$ that is also invariant under the joint action of the transformations $R_{0}$ and $R_{1}$ ; the former condition concerns the invariant measures of the two systems, while the latter concerns their dynamics. Let $\mathcal{J}(\eta_{0},\eta_{1})$ denote the set of all joinings of $(\mathcal{U}_{0},R_{0},\eta_{0})$ and $(\mathcal{U}_{1},R_{1},\eta_{1})$ . Note that this set is non-empty, since the product measure $\eta_{0}\otimes\eta_{1}$ is always a joining. When the transformation $R_{0}:\mathcal{U}_{0}\to\mathcal{U}_{0}$ is fixed but we have not associated any invariant measure with it, we set

[TABLE]

which is the family of joinings of $(\mathcal{U}_{1},R_{1},\eta_{1})$ with all systems of the form $(\mathcal{U}_{0},R_{0},\eta_{0})$ , with $\eta_{0}\in\mathcal{M}(\mathcal{U}_{0},R_{0})$ .

3.2. Entropy

Our statements and proofs also require us to introduce some notions from the entropy theory of dynamical systems. Let $\mathcal{U}$ be a compact metric space, $R:\mathcal{U}\to\mathcal{U}$ continuous, and $\eta\in\mathcal{M}(\mathcal{U},R)$ . For any finite measurable partition $\alpha$ of $\mathcal{U}$ , we define

[TABLE]

where $0\cdot\log 0=0$ by convention. For $k\geq 0$ , let $R^{-k}\alpha=\{R^{-k}A:A\in\alpha\}$ , and for any partitions $\alpha^{0},\dots,\alpha^{n}$ , define their join to be the mutual refinement

[TABLE]

For $n\geq 0$ , let $\alpha_{n}=\bigvee_{k=0}^{n-1}R^{-k}\alpha$ . By standard subadditivity arguments, the following limit exists:

[TABLE]

The measure-theoretic or Kolmogorov-Sinai entropy of $(\mathcal{U},R)$ with respect to $\eta$ is given by $h_{R}(\eta)=\sup_{\alpha}h_{R}(\eta,\alpha)$ , where the supremum is taken over all finite measurable partitions $\alpha$ of $\mathcal{U}$ . We note for future reference that for any $\epsilon>0$ , the value $h_{R}(\eta)$ remains the same if the supremum is instead taken over all finite measurable partitions with diameter less than $\epsilon$ . When the transformation $R$ is clear from context, we may omit the subscript.

3.3. The variational principle for pressure

Let $\mathcal{X}$ be a mixing SFT, and let $f:\mathcal{X}\to\mathbb{R}$ be a Hölder continuous potential. The variational principle [7] for the pressure $\mathcal{P}(f)$ states that

[TABLE]

and furthermore, the supremum is achieved by the measure $\mu\in\mathcal{M}(\mathcal{X},S)$ if and only if $\mu$ is the Gibbs measures associated with $f$ .

3.4. Disintegration of measure

The following result is a special case of standard results on disintegration of Borel measures (see [20]).

Theorem (Disintegration of measure).

Let $\mathcal{U}$ and $\mathcal{Y}$ be standard Borel spaces, and $\phi:\mathcal{U}\times\mathcal{Y}\to\mathcal{Y}$ be the natural projection. Let $\lambda\in\mathcal{M}(\mathcal{U}\times\mathcal{Y})$ , and let $\nu=\lambda\circ\phi^{-1}$ be its image in $\mathcal{M}(\mathcal{Y})$ . Then there is a Borel map $y\mapsto\lambda_{y}$ , from $\mathcal{Y}$ to $\mathcal{M}(\mathcal{U})$ such that for every bounded Borel function $f:\mathcal{U}\times\mathcal{Y}\to\mathbb{R}$ ,

[TABLE]

Moreover, such a map is unique in the following sense: if $y\mapsto\lambda^{\prime}_{y}$ is another such map, then $\lambda_{y}=\lambda^{\prime}_{y}$ for $\nu$ -almost every $y$ .

Note that if $\lambda$ is a joining, then the family $\{\lambda_{y}\}_{y\in\mathcal{Y}}$ satisfies an important invariance property, which we state as Lemma 3 in Section 5.3.

3.5. Fiber entropy

Now we give a definition of fiber entropy, along with statements of some properties relevant to this work; for a thorough introduction, see [27]. Let $\mathcal{U}$ be a compact metric space and $\mathcal{Y}$ be a separable complete metric spaces. Further, let $R:\mathcal{U}\to\mathcal{U}$ be continuous and $T:\mathcal{Y}\to\mathcal{Y}$ be Borel measurable. For any Borel probability measure $\lambda$ on $\mathcal{U}\times\mathcal{Y}$ with $\mathcal{Y}$ -marginal $\nu$ , let $\lambda=\int\lambda_{y}\otimes\delta_{y}\,d\nu(y)$ be its disintegration over $\mathcal{Y}$ . Then for any finite measurable partition $\alpha$ of $\mathcal{U}$ , we define

[TABLE]

Now suppose $\nu\in\mathcal{M}(\mathcal{Y},T)$ . It’s possible to show (see, e.g., [26]) that if $\lambda\in\mathcal{J}(R:\nu)$ and $\lambda=\int\lambda_{y}\otimes\delta_{y}\,d\nu(y)$ is its disintegration over $\nu$ , then for every finite measurable partition $\alpha$ of $\mathcal{U}$ the following limit exists:

[TABLE]

where $\alpha_{n}=\bigvee_{k=0}^{n-1}R^{-k}\alpha$ . Furthermore, when $\lambda$ is ergodic, it can be shown (again see [26]) that for $\nu$ almost every $y$ ,

[TABLE]

The fiber entropy of $\lambda$ over $\nu$ is defined as $h^{\nu}(\lambda)=\sup_{\alpha}h^{\nu}(\lambda,\alpha)$ , where the supremum is taken over all finite measurable partitions $\alpha$ of $\mathcal{U}$ . Note that the supremum may also be taken over partitions with diameter less than any $\epsilon>0$ . The fiber entropy $h^{\nu}(\lambda)$ quantifies the relative entropy of $\lambda$ over $\nu$ .

3.6. Divergence terms

Consider a parameter $\theta\in\Theta$ and a joining $\lambda\in\mathcal{J}(S:\nu)$ . We would like to quantify the divergence of the joining $\lambda$ to the product measure $\mu_{\theta}\otimes\nu$ , as it will play a role in the rate function $V$ . (Note that the measure $\mu_{\theta}\otimes\nu$ may be interpreted as a prior distribution on $\mathcal{X}\times\mathcal{Y}$ given $\theta$ , as the prior on $\mathcal{X}$ is assumed to be independent of the observations.) However, the standard KL-divergence is insufficient for our purposes, since any two ergodic measures for a given system are known to be mutually singular, and hence their KL-divergence will be infinite. Instead, we make the following definitions, which are more suitable for dynamical systems.

Given two Borel probability measures $\eta$ and $\gamma$ on a compact metric space $\mathcal{U}$ and a finite measurable partition $\alpha$ of $\mathcal{U}$ , we write $\eta\prec_{\alpha}\gamma$ whenever $\gamma(C)=0$ implies that $\eta(C)=0$ for $C\in\alpha$ . Let

[TABLE]

where $0\cdot\log\frac{0}{x}=0$ for any $x$ by convention. Note that $KL(\eta:\gamma\mid\alpha)$ is the KL-divergence from $\gamma$ to $\eta$ with respect to the partition $\alpha$ , which is nonnegative.

Now consider a Hölder continuous potential $f:\mathcal{X}\to\mathbb{R}$ on a mixing SFT $\mathcal{X}$ with associated Gibbs measure $\mu\in\mathcal{M}(\mathcal{X},S)$ . Let $\alpha$ be the partition of $\mathcal{X}$ into cylinder sets of the form $x[0]$ for some $x\in\mathcal{X}$ , and let $\eta\in\mathcal{M}(\mathcal{X},S)$ be ergodic. In this situation, it is known [8] that

[TABLE]

where we recall that $\mathcal{P}(f)$ is the pressure of $f$ , the partition $\alpha_{n}$ is defined to be $\bigvee_{k=0}^{n-1}S^{-k}\alpha$ , and $h(\eta)$ is the entropy of $\eta$ with respect to $S$ . Next we generalize this result to handle the relative situation, which involves joinings and relative entropy.

Lemma 1.

Let $f:\mathcal{X}\to\mathbb{R}$ be a Hölder continuous potential on a mixing SFT $\mathcal{X}$ with associated Gibbs measure $\mu$ . Let $\alpha$ be the partition of $\mathcal{X}$ into cylinder sets of the form $x[0]$ , and let $\lambda\in\mathcal{J}(S:\nu)$ be ergodic. Then for $\nu$ -almost every $y\in\mathcal{Y}$ ,

[TABLE]

We defer the proof of Lemma 1 to Section 5.6. Based on this lemma, we make the following definition.

Definition 2.

Let $\mathcal{X}$ be a mixing SFT, $f:\mathcal{X}\to\mathbb{R}$ a Hölder continuous function, and $\mu$ the associated Gibbs measure. Further, let $(\mathcal{Y},T,\nu)$ be an ergodic system. Then define the relative divergence rate of $\lambda\in\mathcal{J}(S:\nu)$ to $\mu$ to be

[TABLE]

In the present setting, $D(\lambda:\mu)$ is always finite, and one may check that it is also nonnegative (see Lemma 8).

3.7. The rate function

In this section we define and discuss the rate function $V:\Theta\to\mathbb{R}$ whose existence is guaranteed by Theorem 1.

Definition 3.

For $\theta\in\Theta$ , let

[TABLE]

Note that the variational expression defining $V$ contains the sum of an expected loss term and a divergence term. It is known that Bayesian posterior distributions satisfy a similar variational principle in the finite sample setting (see [25, 52, 53]). Our results show that this interpretation passes to the limit as the number of samples tends to infinity.

By Proposition 5, which appears in Section 6, we have that $V$ is lower semi-continuous. Since the loss function is continuous, the proof the Proposition 5 essentially follows from the the upper semi-continuity of the fiber entropy on the space of joinings $\mathcal{J}(S:\nu)$ .

Remark 8.

Consider the introduction of an inverse temperature parameter $\beta\in\mathbb{R}$ , as discussed in Remark 3, and let $\ell_{\beta}=\beta\cdot\ell$ be the associated loss function. If we let $V_{\beta}$ be the associated rate function, then we see from Definition 3 that

[TABLE]

Dividing by $\beta$ and letting $\beta$ tend to infinity to investigate the ground state behavior, it is clear that the associated variational expression is

[TABLE]

Interestingly, this variational expression has been studied recently as part of an asymptotic analysis of estimators based on empirical risk minimization for dynamical systems [36, 37]. Indeed, the solution set $\Theta_{\infty}$ of this ground state variational problem exactly characterizes the set of possible limits of parameter estimates that asymptotically minimize average empirical risk.

4. Connections to previous work

In the setting of i.i.d. samples, Doob [13] established Bayesian posterior consistency for almost every parameter value in the support of the prior using Martingale methods. Later, Schwartz [43] gave necessary and sufficient conditions for posterior consistency at individual parameter values in the i.i.d. setting; these conditions require that the prior charge all KL-neighborhoods of the parameter and that there exist a sequence of tests giving exponential separation of the parameter from other parameters. The challenges and pitfalls of proving posterior consistency for nonparametric models were highlighted by Diaconis and Freedman in [12]. The negative results motivated much of the recent work in Bayesian nonparametrics, as well as studying convergence in other metrics on the space of probability distributions (such as Hellinger), and consideration of rates of convergence. For a detailed review of the Bayesian nonparametic literature we refer the reader to the recent book by Ghosal and van der Vaart [19].

Recent years have witnessed substantial interest in moving beyond the i.i.d. setting and considering statistical inference for dependent processes, including processes arising from dynamical systems. Statistical problems receiving recent attention in the context of dynamical systems include denoising (or filtering) [28, 29], consistency of maximum likelihood estimation [34], forecasting and density estimation [21, 44], empirical risk minimization [36, 37], and data assimilation and uncertainty quantification [30]. For a survey of this area, see [35]. Bayesian posterior consistency for dependent processes has also received attention in the literature. In particular, posterior consistency has been established for certain families of finite state hidden Markov chains [9, 14, 17, 45].

The idea of a variational formulation of Bayesian inference was developed by Zellner [51] and the link between statistical mechanics and information theory with Bayesian inference was at the heart of the inference framework advocated by Edwin T. Jaynes [24], a perspective that influenced Zellner [51]. Formulating Bayesian inference as a variational problem for infinite dimensional problems has been explored in the control theory and inverse problems literatures [33, 38, 39]. In [39] a variational formulation of Bayesian inference was developed for the problem of channel coding using ideas from statistical mechanics. In [38] a variational characterization of Bayesian nonlinear estimation was shown to take the same form Gibbs measures in statistical mechanics. In [33], the authors studied the inference problem of finding the most likely path given a Brownian dynamics model from molecular dynamics, which takes the form of a gradient flow in a potential, subject to small thermal fluctuations. In this problem setup, a variational solution was proposed for Bayesian inference.

The setting and results of [36] and [37] are worthy of some discussion, as they may be considered frequentist analogues of the present work. Indeed, the setting of this previous work involves observations from an unknown ergodic system, a model family consisting of topological dynamical systems, and a loss function connecting the models to the observations, as in the present work. Given this setting, the previous work analyzes the convergence of parameter estimates obtained by empirical risk minimization, whereas we study the convergence of parameter estimates based on Bayesian updates (in the form of the Gibbs posterior). One additional difference is that the previous results on empirical risk minimization are more general, in the sense that the model families need only consist of continuous maps on compact metric spaces; whereas, in our Bayesian setting, we specialize to the case of SFTs with Gibbs measures. This focus on Gibbs measures in the Bayesian setting arises precisely because Gibbs measures satisfy the necessary exponential estimates (the Gibbs property (1)) to make the asymptotic analysis work. It should be noted that a Bayesian framework provides estimates of uncertainty which empirical risk minimization does not.

The Gibbs posterior principle can be derived from general principles as a valid method to update belief distributions in the presence of a loss function [6]. In particular, this framework for updating beliefs remains valid when one does not have access to a true likelihood. This inference framework has also been shown to have advantages in some settings [25]. One of the motivations for the use of the Gibbs posterior in [25] was that exponentiating a robust loss function can better accommodate model misspecification, e.g., when the assumed likelihood is not the sample generating process. A logistic regression example is provided in [25] for which the usual posterior-based logistic regression produces suboptimal classification error even from among the misspecified logistic regression models, while the Gibbs posterior is optimal. Another argument for using a loss-based approach comes from the robust statistics literature [22]. A key idea in robust statistics is that one can define loss functions that are not sensitive to contamination of standard error or likelihood models. Thus, even if the model is misspecified, inference using the robust loss function is still reliable. The advantage of the Gibbs posterior framework is that one can specify coherent Bayesian updating using a robust loss function and not have to specify the data generating process.

The thermodynamic formalism in dynamical systems, originally pioneered by Sinai, Ruelle, and Bowen, involves adaptation of many ideas and methods from statistical physics to the setting of dynamical systems, and it has played a large role in the development of ergodic theory and dynamical systems over many years. For an introduction to the area and some connections to statistical physics, see the books by Bowen [7], Ruelle [41], or Walters [46]. Let us mention that connections to Markov chains and other stochastic processes have a long history in this area [5, 49, 50]. Additionally, relative equilibrium states were studied by Ledrappier and Walters [31], and recent results on uniqueness of relative equilibrium states [1, 2, 4, 40] may contain interesting ideas to apply towards Bayesian posterior consistency.

5. Technical preliminaries

This section contains several technical results that we use later in the proofs of the main theorems.

5.1. Pressure Lemma

We refer to the following elementary fact, which is an easy consequence of Jensen’s inequality, as the Pressure Lemma; see [46, Lemma 9.9].

Lemma 2.

Let $a_{1},\dots,a_{k}$ be real numbers. If $p_{i}\geq 0$ and $\sum_{i=1}^{k}p_{i}=1$ , then

[TABLE]

with equality if and only if

[TABLE]

5.2. The space of joinings and the ergodic decomposition

Our proofs rely on a general version of the ergodic decomposition for invariant probability measures. The following version, a restatement of [42, Theorem 2.5], is sufficient for our purposes.

Theorem (The Ergodic Decomposition).

Suppose that $R:\mathcal{U}\to\mathcal{U}$ is a Borel measurable map of a Polish space $\mathcal{U}$ and that $\mu\in\mathcal{M}(\mathcal{U},R)$ . Then there exists a Borel probability measure $Q$ on $\mathcal{M}(\mathcal{U})$ such that

(1)

$Q\bigl{(}\{\eta\mbox{ is invariant and ergodic for$ R $}\}\bigr{)}=1$ ** 2. (2)

If $f\in L^{1}(\lambda)$ , then $f\in L^{1}(\eta)$ for $Q$ -almost every $\eta$ , and

[TABLE]

Whenever (2) holds, we write $\mu=\int\eta\,dQ$ .

Additionally, we require the following results about the structure of $\mathcal{J}(S:\nu)$ from [36].

Theorem (Structure of the space of joinings).

Suppose $R:\mathcal{U}\to\mathcal{U}$ is a continuous map of a compact metrizable space and $(\mathcal{Y},T,\nu)$ is an ergodic measure-preserving system as in Section 1. Then $\mathcal{J}(R:\nu)$ is non-empty, compact, and convex. Furthermore, a joining $\lambda\in\mathcal{J}(R:\nu)$ is an extreme point of $\mathcal{J}(R:\nu)$ if and only if $\lambda$ is ergodic for $R\times T$ . Lastly, if $\lambda\in\mathcal{J}(R:\nu)$ and $\lambda=\int\eta\,dQ$ is its ergodic decomposition, then $Q$ -almost every $\eta$ is in $\mathcal{J}(R:\nu)$ .

Let $\lambda\in\mathcal{J}(R:\nu)$ . By the above theorem, the ergodic decomposition of $\lambda$ is a representation of $\lambda$ as an integral combination of the extreme points of $\mathcal{J}(R:\nu)$ . A function $F:\mathcal{J}(R:\nu)\to\mathbb{R}$ is called harmonic if for each $\lambda\in\mathcal{J}(R:\nu)$ ,

[TABLE]

where $\lambda=\int\eta\,dQ$ is the ergodic decomposition of $\lambda$ .

5.3. Disintegration results

Suppose $R:\mathcal{U}\to\mathcal{U}$ is a continuous map of a compact metric space and $(\mathcal{Y},T,\nu)$ is an ergodic system. It is well known in ergodic theory (see [20]) that for any joining $\lambda\in\mathcal{J}(R:\nu)$ , if $\lambda=\int\lambda_{y}\otimes\delta_{y}\,d\nu(y)$ is its disintegration over $\nu$ , then the family of measures $\{\lambda_{y}\}_{y\in\mathcal{Y}}$ satisfies an additional invariance property, which we state in the following lemma.

Lemma 3.

Let $\lambda\in\mathcal{J}(R:\nu)$ , and let $\lambda=\int\lambda_{y}\otimes\delta_{y}\,d\nu(y)$ be its disintegration over $\nu$ . Then $(\lambda_{y}\otimes\delta_{y})\circ(R\times T)^{-1}=\lambda_{Ty}\otimes\delta_{Ty}$ for $\nu$ -almost every $y\in\mathcal{Y}$ , and hence, for every $f\in L^{1}(\lambda)$ and $\nu$ -almost every $y\in\mathcal{Y}$ ,

[TABLE]

5.4. Limiting average loss

The following lemma will be applied to the limiting average loss. Recall that when $R:\mathcal{U}\to\mathcal{U}$ is a continuous map of a compact metric space, the space $\mathcal{J}(R:\nu)$ of joinings is non-empty. For notation, if $f:\mathcal{U}\times\mathcal{Y}\to\mathbb{R}$ , then we let $f_{n}(u,y)=\sum_{k=0}^{n-1}f(R^{k}u,T^{k}y)$ .

Lemma 4.

Suppose that $R:\mathcal{U}\to\mathcal{U}$ is a Borel self-map of a complete metric space $\mathcal{U}$ , and that $f:\mathcal{U}\times\mathcal{Y}\to\mathbb{R}$ is a Borel function for which there exists $f^{*}:\mathcal{Y}\to\mathbb{R}$ in $L^{1}(\nu)$ such that $\sup_{u\in U}|f(u,y)|\leq f^{*}(y)$ for each $y\in\mathcal{Y}$ . Then for any joining $\lambda\in\mathcal{J}(R:\nu)$ with disintegration $\lambda=\int\lambda_{y}\otimes\delta_{y}\,d\nu(y)$ over $\nu$ , for $\nu$ -almost every $y\in\mathcal{Y}$ ,

[TABLE]

Proof.

For $y\in\mathcal{Y}$ define $\tilde{f}(y)=\int f(u,y)\,d\lambda_{y}(u)$ . Then $\tilde{f}\in L^{1}(\nu)$ , since $f\in L^{1}(\lambda)$ (using the hypotheses involving $f^{*}$ ). Now Lemma 3, together with the pointwise ergodic theorem, yields that for $\nu$ almost every $y$ ,

[TABLE]

∎

5.5. Fiber entropy

We require two additional properties of the fiber entropy in our setting. The first property is that fiber entropy is harmonic. This fact appears with proof as Lemma 3.2 (iii) in [31] in a setting under which $T:\mathcal{Y}\to\mathcal{Y}$ is a continuous map of a compact space, but careful inspection shows that the proof does not depend on this hypothesis.

Lemma 5.

The map $\lambda\mapsto h^{\nu}(\lambda)$ from $\mathcal{J}(R:\nu)$ to the non-negative extended reals satisfies the following property: if $\lambda=\int\eta\,dQ(\eta)$ is the ergodic decomposition of $\lambda$ , then

[TABLE]

Next, we note that fiber entropy function is upper semi-continuous in our setting. The proof of Lemma 2.2 in [47] establishes upper semi-continuity of fiber entropy in a setting closely related to ours. By making only minor modifications of that proof, one may adapt it to our setting and prove the following lemma.

Lemma 6.

Let $\Theta$ , $(\mathcal{X},S)$ , and $(\mathcal{Y},T,\nu)$ be as in the introduction, and let $R=I_{\Theta}\times S$ act on the product space $\mathcal{U}=\Theta\times\mathcal{X}$ . Then the map $\lambda\mapsto h^{\nu}(\lambda)$ from $\mathcal{J}(R:\nu)$ to $\mathbb{R}$ is upper semi-continuous.

5.6. Divergence terms and average information

Define

[TABLE]

where $0\cdot\log 0=0$ by convention. With these definitions, we always have

[TABLE]

Recall that $H(\eta,\alpha)$ may be interpreted as the expected information of $\eta$ under the partition $\alpha$ , where the expectation is with respect to $\eta$ . In contrast, $-L(\eta:\gamma\mid\alpha)$ may be interpreted as the expected information of $\gamma$ under the partition $\alpha$ , where the expectation is again taken with respect to $\eta$ . In what follows, if $\alpha$ is a partition of a space $\mathcal{U}$ and $u\in\mathcal{U}$ , we let $\alpha(u)$ denote the partition element containing $u$ . Here we restate and then prove Lemma 1.

Lemma 7.

Let $f:\mathcal{X}\to\mathbb{R}$ be a Hölder continuous potential on a mixing SFT $\mathcal{X}$ with associated Gibbs measure $\mu$ . Let $\alpha$ be the partition of $\mathcal{X}$ into cylinder sets of the form $x[0]$ , and let $\lambda\in\mathcal{J}(S:\nu)$ be ergodic. Then for $\nu$ -almost every $y\in\mathcal{Y}$ ,

[TABLE]

Proof.

Recall that by the Gibbs property for $\mu$ , for any $n\geq 1$ and $x$ in $\mathcal{X}$ , we have

[TABLE]

Taking logarithms yields the bound

[TABLE]

As this inequality is uniform in $x$ , we may integrate with respect to $\lambda_{y}$ to obtain

[TABLE]

Dividing by $n$ and applying Lemma 4 gives

[TABLE]

It follows from (10) that $KL(\lambda_{y}:\mu\mid\alpha_{n})=-H(\lambda_{y},\alpha_{n})-L(\lambda_{y}:\mu\mid\alpha_{n})$ . Since $\lambda$ is ergodic, for $\nu$ -almost every $y$ , we have $n^{-1}H(\lambda_{y},\alpha_{n})\to h^{\nu}(\lambda,\alpha)=h^{\nu}(\lambda)$ , where the equality is a result of the fact that $\alpha$ is a generating partition for $(\mathcal{X},S)$ . Combining this fact with (11), we find that for $\nu$ -almost every $y$ ,

[TABLE]

as desired. ∎

Now we prove a lemma that guarantees that $D(\lambda:\mu_{\theta})\geq 0$ .

Lemma 8.

For each $\theta\in\Theta$ and $\lambda\in\mathcal{J}(S,\nu)$ ,

[TABLE]

Proof.

Let $\mu$ be the $\mathcal{X}$ -marginal of $\lambda$ . Then $h^{\nu}(\lambda)\leq h^{\nu}(\mu\otimes\nu)=h(\mu)$ , where the inequality follows from elementary information theoretic facts concerning conditional entropy (see [10]) and the equality is a basic property of fiber entropy. Then by the variational principle for pressure (7),

[TABLE]

as desired. ∎

We now establish a lemma that is used in the proof of Theorem 1. This result allows us to approximate the expected information in the prior $P_{0}$ , where the expectation is with respect to an arbitrary measure, in terms of an average of a continuous function. These types of estimates are available precisely because our model class consists of Gibbs measures: indeed, they do not hold for arbitrary invariant measures for dynamical systems.

For any Borel probability measure $\eta$ on $\Theta\times\mathcal{X}$ , let $\eta_{n}$ denote its time-average up to time $n$ :

[TABLE]

where $I_{\Theta}:\Theta\to\Theta$ is the identity.

Lemma 9.

Let $K$ be the constant in the uniform Gibbs property (2). For any $\epsilon>0$ there exists $\delta>0$ such that if the diameter of $\alpha$ is less than $\delta$ and $\beta$ is the partition of $\mathcal{X}$ into cylinder sets of the form $x[0]$ , then for any Borel probability measure $\eta$ on $\Theta\times\mathcal{X}$ , and any $n\geq 0$ ,

[TABLE]

Proof.

Let $\epsilon>0$ . By the uniform continuity of $f_{\theta}$ and $\mathcal{P}(f_{\theta})$ in $\theta$ and the uniform Gibbs property, there exists $\delta>0$ such that if the diameter of $\alpha$ is less than $\delta$ and $\beta$ is the partition of $\mathcal{X}$ into sets of the form $x[0]$ , then for all $\theta\in\Theta$ , $x\in\mathcal{X}$ , and $n\geq 1$ ,

[TABLE]

Taking logarithms and dividing by $n$ , we obtain the inequality

[TABLE]

which is uniform over $(\theta,x)\in\Theta\times\mathcal{X}$ . Now let $\eta$ be any Borel probability measure on $\Theta\times\mathcal{X}$ . Then by integrating with respect to $\eta$ , we see that

[TABLE]

∎

6. Semicontinuity of the rate function and $\Theta_{\min}$

Proposition 5.

The map $V:\Theta\to\mathbb{R}$ defined in Definition 3 is lower semi-continuous, and hence the set $\Theta_{\min}$ is compact and non-empty.

Proof.

Let $\mathcal{U}=\Theta\times\mathcal{X}$ and let $R:\mathcal{U}\to\mathcal{U}$ be given by $R=I_{\Theta}\times S$ , where $I_{\Theta}$ is the identity on $\Theta$ . Define $\psi:\mathcal{U}\times\mathcal{Y}\to\mathbb{R}$ by

[TABLE]

which is continuous and satisfies $\sup_{u\in\mathcal{U}}|\psi(u,y)|\leq\psi^{*}\in L^{1}(\nu)$ . Finally, define $F:\mathcal{J}(R:\nu)\to\mathbb{R}$ by

[TABLE]

Since $\psi$ is continuous and $h^{\nu}$ is upper semi-continuous (by Lemma 6), $F$ is upper semi-continuous. Let $\operatorname{proj}_{\Theta}:\mathcal{J}(R:\nu)\to\mathcal{M}(\Theta)$ be defined by setting $\operatorname{proj}_{\Theta}(\lambda)$ to be the $\Theta$ -marginal of $\lambda$ , which is a continuous surjection of compact spaces. One may easily check from the definition of upper semicontinuity that the function

[TABLE]

is also upper semicontinuous. Since $V(\theta)$ is the negative of this function, we conclude that $V$ is lower semi-continuous.

For the second part of the proposition, we note that $\Theta_{\min}$ is the $\operatorname*{argmin}$ of the lower semi-continuous function $V$ on the compact set $\Theta$ , and hence it is non-empty and compact. ∎

7. Convergence of the partition function and a variational principle

In this section, we prove Theorem 1, which concerns the convergence of the average log normalizing constant (partition function) $n^{-1}\log Z_{n}$ . The starting point of the proof, which is an application of the Pressure Lemma, allows us to express the main statistical object, the Gibbs posterior distribution, as the solution of a variational problem involving information theoretic notions such as entropy and average information, which have long been studied in dynamics. The proof of Theorem 1 follows.

To ease notation slightly in this section, we let $g=-\ell$ and $g_{n}=-\ell_{n}$ , where $\ell_{n}$ is defined in (3). We also set $\mathcal{U}=\Theta\times\mathcal{X}$ and $R(\theta,x)=(\theta,S(x))$ . For $\lambda\in\mathcal{J}(R:\nu)$ , we will have use for the notation

[TABLE]

Although we do not use this fact, we note that $G(\lambda)$ can be written as an integral over $\theta$ of terms of the form $D(\lambda_{\theta},\mu_{\theta})$ (as in Definition 2). Lemma 5 ensures that $h^{\nu}(\cdot)$ is harmonic, and therefore the same is true of $G:\mathcal{J}(R:\nu)\to\mathbb{R}$ . In this notation, our goal is to prove

[TABLE]

We present the proof in two stages: first we establish that the expression in right-hand side is a lower bound for $\lim_{n}n^{-1}\log Z_{n}$ , and then we prove that the same expression provides an upper bound.

7.1. Lower bound

The goal of this section is to prove the following result.

Proposition 6.

For $\nu$ -almost every $y\in\mathcal{Y}$ ,

[TABLE]

where $Z_{n}=Z_{n}(y)$ .

Before proving this proposition, we first establish a lemma. If $\eta$ is a Borel probability measure on $\Theta\times\mathcal{X}$ and $\eta(C)>0$ , then let $\eta_{C}$ denote the conditional distribution $\eta(\cdot\mid C)$ . Also, we say that $\beta$ is a partition of $\mathcal{X}$ according to central words whenever $\beta=\{[x_{-m}^{m}]:x\in\mathcal{X}\}$ for some $m\geq 0$ .

Lemma 10.

Let $\alpha$ be a finite measurable partition of $\Theta$ with $\operatorname{diam}(\alpha)<\delta$ , and let $\beta$ be a partition of $\mathcal{X}$ according to central words such that $\operatorname{diam}(\beta)<\delta$ . Then for any Borel probability measure $\eta$ on $\Theta\times\mathcal{X}$ , any $y\in\mathcal{Y}$ , and any $n\geq 1$ ,

[TABLE]

where $\rho_{\delta}$ is the local difference function appearing in property (iii) of the loss.

Proof.

If $\eta\nprec_{\alpha\times\beta_{n}}P_{0}$ , then the inequality holds trivially. Now suppose $\eta\prec_{\alpha\times\beta_{n}}P_{0}$ , and let $\xi=\{C\in\alpha\times\beta_{n}:\eta(C)>0\}$ . For $C\in\xi$ and $(\theta,x),(\theta^{\prime},x^{\prime})\in C$ , property (iii) of the loss function, and our hypotheses on $\alpha$ and $\beta$ yield that

[TABLE]

Integrating out $(\theta^{\prime},x^{\prime})$ with respect to the conditional distribution $\eta_{C}$ gives

[TABLE]

After exponentiation and integration with respect to the $P_{0,C}$ , we get

[TABLE]

Invoking Lemma 2 and the inequality above, we find that

[TABLE]

as was to be shown. ∎

Proof of Proposition 6. Fix an ergodic joining $\lambda\in\mathcal{J}(R:\nu)$ and $\epsilon>0$ . Let $\delta>0$ be sufficiently small that the bound of Lemma 9 holds and that $\int\rho_{\delta}\,d\nu<\epsilon$ (using property (iii) of the loss). Fix a finite measurable partition $\alpha$ of $\Theta$ such that $\operatorname{diam}(\alpha)<\delta$ , and select $m$ large enough so that the partition $\beta$ of $\mathcal{X}$ generated by central words of length $m$ satisfies $\operatorname{diam}(\beta)<\delta$ . Then for $\nu$ -almost every $y$ ,

[TABLE]

where the inequality follows from Lemma 10. Dividing each side of the inequality above by $n$ , and then letting $n$ tend to infinity, Lemma 4, Lemma 9, and the ergodic theorem together imply that for $\nu$ -almost every $y\in\mathcal{Y}$ ,

[TABLE]

Taking the supremum over all partitions $\alpha$ of $\Theta$ with diameter less than $\delta$ and all partitions $\beta$ of $\mathcal{X}$ generated by central words of length at least $m$ , we obtain the inequality

[TABLE]

Since $\epsilon>0$ was arbitrary,

[TABLE]

As this inequality holds for all ergodic $\lambda\in\mathcal{J}(R:\nu)$ and the left-hand side is harmonic in $\lambda$ , we have

[TABLE]

which completes the proof. $\Box$

7.2. Upper bound

In Proposition 7 below we establish an almost sure upper bound on the limiting behavior of $n^{-1}\log Z_{n}(y)$ . Together with the lower bound in Proposition 6, this completes the proof of Theorem 1.

Proposition 7.

For $\nu$ -almost every $y\in\mathcal{Y}$ ,

[TABLE]

We begin with a preliminary lemma. Recall that $P_{0}$ is the prior distribution on $\Theta\times\mathcal{X}$ generated by the prior $\pi_{0}$ (defined in (4)) and the family $\{\mu_{\theta}:\theta\in\Theta\}$ , while $P_{n}(\cdot\mid y)$ is the Gibbs posterior distribution associated with $y,Ty,\ldots,T^{n-1}y$ (defined in (5)). To simplify notation, in what follows $P_{n}(\cdot\mid y)$ is denoted by $P_{n}^{y}$ .

Lemma 11.

If $\alpha$ is a finite measurable partition of $\Theta\times\mathcal{X}$ with diameter less than $\delta$ then for $y\in\mathcal{Y}$ and $n\geq 1$ ,

[TABLE]

Proof.

Let $\alpha$ be a finite measurable partition of $\Theta\times\mathcal{X}$ with $\operatorname{diam}(\alpha)<\delta$ , and let $y\in\mathcal{Y}$ . By definition $P_{n}^{y}$ and $P_{0}$ are equivalent measures, and hence $P_{n}^{y}\prec_{\alpha_{n}}P_{0}$ and $P_{0}\prec_{\alpha_{n}}P_{n}^{y}$ . Let $\xi=\{C\in\alpha_{n}:P_{0}(C)>0\}=\{C\in\alpha_{n}:P_{n}^{y}(C)>0\}$ .

Fix $C\in\xi$ for the moment. For points $(\theta,x),(\theta^{\prime},x^{\prime})\in C$ the hypothesis on $\alpha$ ensures that

[TABLE]

where $\rho_{\delta}()$ is defined in condition (iii) of the loss. Exponentiating both sides of the inequality and integrating $(\theta,x)$ with respect to the prior $P_{0,C}$ conditioned on being in $C$ yields

[TABLE]

Taking logarithms and integrating $(\theta^{\prime},x^{\prime})$ with respect to the posterior $P_{n,C}^{y}$ conditioned on being in $C$ yields

[TABLE]

By the definition of $P_{n}$ and Lemma 2 we have

[TABLE]

Applying inequality (12) to the terms of the final sum above, we see that

[TABLE]

as desired. ∎

Proof of Proposition 7. To begin the proof, define

[TABLE]

By [26, Lemma 2.1], for $\nu$ -almost every $y$ , the sequence $\{\eta_{n}^{y}\}_{n}$ is tight and all of its limit points are contained in $\mathcal{J}(R:\nu)$ . For a given $y$ in this set of full measure, let $\lambda$ be such a limit point, with $\eta_{n_{k}}^{y}\to\lambda$ .

Let $\epsilon>0$ , and choose $\delta>0$ such that $\int\rho_{\delta}\,d\nu<\epsilon$ . Choose a finite measurable partition $\alpha$ of $\Theta\times\mathcal{X}$ such that $\operatorname{diam}(\alpha)<\delta$ and $\operatorname{proj}_{\Theta\times\mathcal{X}}(\lambda)(\partial\alpha)=0$ (which exists since $\Theta\times\mathcal{X}$ is compact [46, Lemma 8.5]). By adapting an argument from [46, p. 190] involving subadditivity of measure-theoretic entropy, we obtain that for each $q\geq 1$ , for $n\geq q$ ,

[TABLE]

where $o(1)$ refers to a term that tends to [math] as $n$ tends to infinity (for fixed $q$ ). Then by letting $n$ tend to infinity and applying [26, Lemma 2.1] again, we see that

[TABLE]

where the conditional entropy $H(\cdot\mid\mathcal{Y})$ is defined in (8). To proceed with the proof, we require the following lemma. Recall that at the beginning of this section, we set $g=-\ell$ and $g_{n}=-\ell_{n}$ .

Lemma 12.

Let $\{Q_{n}\}_{n}$ be any sequence of measures on $\Theta\times\mathcal{X}$ . For each $n\geq 1$ and $y\in\mathcal{Y}$ define

[TABLE]

If the subsequence $\{\eta_{n_{k}}\}_{k}$ converges to $\lambda$ , then

[TABLE]

Proof.

By definition of $\eta_{n}$ ,

[TABLE]

Then the desired limit follows from the fact that $\{\eta_{n_{k}}\}_{k}$ converges to $\lambda$ and $g$ is continuous. ∎

Combining Lemma 12 with Lemmas 9 and 11, we find that for $\nu$ -almost every $y\in\mathcal{Y}$

[TABLE]

Letting $q$ tend to infinity, we get

[TABLE]

Since $\epsilon$ was arbitrary, we obtain

[TABLE]

This concludes the proof of Proposition 7. $\Box$

8. Convergence of Gibbs posterior distributions

The purpose of this section is to establish Theorem 2 concerning convergence of the Gibbs posterior distributions to the solution set of a variational problem. From the dynamics point of view, this convergence highlights the role of the variational problem and the associated equilibirum joinings. We believe these objects to be worthy of further study. From the statistical point of view, this result describes the concentration of posterior distributions, which is of interest in any frequentist analysis of Bayesian methods. The proof follows somewhat directly from Theorem 1.

Proof of Theorem 2. Let $U$ be an open neighborhood of $\Theta_{\min}$ . Let $F=\Theta\setminus U$ , which is closed and therefore compact. If $\pi_{0}(F)=0$ , then $\pi_{n}(F\mid y)=0$ for all $n$ . Now suppose $\pi_{0}(F)>0$ , and let $\tilde{\pi}_{0}=\pi_{0}(\cdot\mid F)$ be the conditional prior on $F$ . Let $V_{*}$ be the common value of $V(\theta)$ for $\theta\in\Theta_{\min}$ . As $V:\Theta\to\mathbb{R}$ is lower semi-continuous and $F$ is compact and disjoint from $\Theta_{\min}$ , there exists $\epsilon>0$ such that $\inf_{\theta\in F}V(\theta)\geq V_{*}+\epsilon$ . Now we apply Theorem 1 in two ways: first, with the full parameter set $\Theta$ and prior $\pi_{0}$ , and second, with $F$ in place of $\Theta$ and the conditional prior $\tilde{\pi}_{0}$ in place of $\pi_{0}$ . Let $Z_{n}^{F}$ denote the normalizing constant in the second case. Then for $\nu$ -almost every $y\in\mathcal{Y}$ , there exists $N_{1}=N_{1}(y)$ and $N_{2}=N_{2}(y)$ such that for all $n\geq N_{1}$ ,

[TABLE]

and for all $n\geq N_{2}$ ,

[TABLE]

Then for all $n\geq\max(N_{1},N_{2})$ , we have

[TABLE]

Thus, for $\nu$ -almost every $y\in\mathcal{Y}$ , we see that $\pi_{n}(F\mid y)$ tends to [math]. $\Box$

9. Posterior consistency for Gibbs processes

Here we consider the problem of inference from direct observations of a Gibbs process, as described in Section 2.1. Recall that Gibbs processes allow one to model substantial degrees of dependence, with Markov chains of arbitrarily large order as a special case. In the present setting, we are able to establish posterior consistency (Theorem 3). The first step of the proof involves the application of our main results to show that the posterior distributions concentrate around $\Theta_{\min}$ . Interestingly, the second main step of the proof (showing that $\Theta_{\min}=[\theta^{*}]$ ) relies on a celebrated result of Bowen about uniqueness of equilibrium states in dynamics.

Proof of Theorem 3. To begin, let us first establish the connection between the setting of Section 2.1 and the general framework for Gibbs posterior inference in Section 1. Let $\mathcal{Z}$ , $\Theta$ , $\{f_{\theta}:\theta\in\Theta\}$ , $\{\mu_{\theta}:\theta\in\Theta\}$ and $\Pi_{0}$ be as in Section 2.1. In this particular application, we take $\mathcal{X}$ to be the trivial mixing SFT, which consists of exactly one point. Intuitively, $\mathcal{X}$ is unnecessary in this application because we make direct observations of the underlying trajectory (i.e., there is no need for an underlying “hidden” truth). As $\mathcal{X}$ is trivial in this application, we omit it in our notation. Next, we let the observed system $(\mathcal{Y},T,\nu)$ be $(\mathcal{Z},\sigma|_{\mathcal{Z}},\mu^{*})$ . Then we define the loss function $\ell:\Theta\times\mathcal{Y}\to\mathbb{R}$ by setting $\ell(\theta,y)=\mathcal{P}(f_{\theta})-f_{\theta}(y)$ . Using our regularity assumptions on $\Theta$ and $\{f_{\theta}:\theta\in\Theta\}$ , one may easily check that conditions (i)-(iii) are satisfied. We have now specified all the objects necessary for the general framework of Section 1. Let $\pi_{0}=\Pi_{0}$ , and for $n\geq 1$ , let $\pi_{n}(\cdot\mid y)$ be the Gibbs posterior distribution on $\Theta$ given observations $(y,\dots,T^{n-1}y)$ . We remind the reader that $\pi_{n}(\cdot\mid y)$ and $\Pi_{n}(\cdot\mid y_{0}^{n-1})$ are formally distinct distributions. Nonetheless, the following lemma shows that they are closely related.

Lemma 13.

Let $K$ be the uniform Gibbs constant for the family $\{\mu_{\theta}:\theta\in\Theta\}$ . Then for any Borel set $F\subset\Theta$ and $n\geq 1$ , for $y=\{y_{n}\}\in\mathcal{Y}$ ,

[TABLE]

Proof.

By the uniform Gibbs property, for any $y\in\mathcal{Y}$ , $\theta\in\Theta$ , and $n\geq 1$ , we have

[TABLE]

Let $F\subset\Theta$ be a Borel set. Integrating the inequality above with respect to $\pi_{0}=\Pi_{0}$ yields the inequalities

[TABLE]

Applying these upper and lower bounds to the sets $F$ and $\Theta$ we find that

[TABLE]

and similarly,

[TABLE]

∎

We require one additional fact before finishing the proof of the theorem.

Lemma 14.

Under the present hypotheses, $\Theta_{\min}=[\theta^{*}]$ .

Proof.

Recall that $\Theta_{\min}$ is defined as the set of $\theta\in\Theta$ such that $V(\theta)=\inf\{V(\theta^{\prime}):\theta^{\prime}\in\Theta\}$ , where $V(\theta)$ is the rate function

[TABLE]

As we have chosen $(\mathcal{X},S)$ to be trivial and $\nu=\mu_{\theta^{*}}$ in this application, the set of joinings $\mathcal{J}(S:\nu)$ contains only the trivial joining $\lambda=\delta_{x}\otimes\mu_{\theta^{*}}$ . Hence the definition of $\ell$ ensures that

[TABLE]

where we have used that the fiber entropy of $\delta_{x}\otimes\mu_{\theta^{*}}$ over $\mu_{\theta^{*}}$ is trivially zero. To finish the proof, we will show that $V(\theta)$ is minimized if and only if $\mu_{\theta}=\mu_{\theta*}$ , and therefore $\Theta_{\min}=[\theta^{*}]$ .

First suppose that $\mu_{\theta}\neq\mu_{\theta^{*}}$ . By the uniqueness of the Gibbs measure $\mu_{\theta}$ (see [7]) and the variational principle for pressure, we have that

[TABLE]

Subtracting $\int f_{\theta}d\mu_{\theta^{*}}$ from both sides, we obtain

[TABLE]

Then by the variational principle for pressure and this inequality, we have

[TABLE]

Hence $\theta\notin\Theta_{\min}$ , and we conclude that $\Theta_{\min}\subset[\theta^{*}]$ .

Now suppose that $\mu_{\theta}=\mu_{\theta^{*}}$ . Then by the variational principle for pressure and the fact that $\mu_{\theta}=\mu_{\theta^{*}}$ ,

[TABLE]

Thus $\theta\in\Theta_{\min}$ , and since $\theta\in[\theta^{*}]$ was arbitrary, we have shown that $\Theta_{\min}=[\theta^{*}]$ . ∎

We now complete the proof of Theorem 3. Let $U\subset\Theta$ be an open set such that $[\theta^{*}]\subset\mathcal{U}$ , and let $F=\Theta\setminus U$ . By Lemma 14, we have $[\theta^{*}]=\Theta_{\min}$ . Hence by Theorem 2, for $\mu^{*}$ -almost every $y\in\mathcal{Y}$ , the Gibbs posterior $\pi_{n}$ satisfies $\pi_{n}(F\mid y)\to 0$ . Then by Lemma 13, for $\mu^{*}$ -almost every $y\in\mathcal{Y}$ , we see that the standard posterior satisfies $\Pi_{n}(F\mid y_{0}^{n-1})\to 0$ , as desired. $\Box$

10. Posterior consistency for hidden Gibbs processes

In this section we establish posterior consistency for hidden Gibbs processes, as in Section 2.2. In addition to modeling substantial dependence with the underlying Gibbs processes, this setting also allows for quite general observational noise models. Note that hidden Markov models with arbitrarily large order appear as a special case in this framework. Here the first part of the proof involves an application of our main results to show that the posterior converges to the set $\Theta_{\min}$ . However, the second part of the proof begins with the well-known fact that the Gibbs measures $\mu_{\theta}$ satisfy large deviations principles (see [48]), and then relies on some recent results from [34] connecting these large deviations properties to the likelihood function in our general observational framework.

Proof of Theorem 4. We begin by placing the setting of Section 2.2 within the general framework of Section 1. Let $\mathcal{X}$ , $\{f_{\theta}:\theta\in\Theta\}$ , $\{\mu_{\theta}:\theta\in\Theta\}$ , $\Pi_{0}$ , $\mathcal{U}$ , $m$ , and $\{\varphi_{\theta}(\cdot\mid x):\theta\in\Theta,x\in\mathcal{X}\}$ be as in Section 2.2. To define the observation space in our general framework, we let $\mathcal{Y}=\mathcal{U}^{\mathbb{N}}$ . We define the map $T:\mathcal{Y}\to\mathcal{Y}$ to be the left-shift, i.e., if $y=\{y_{k}\}\in\mathcal{Y}$ , then $T(y)$ is the sequence whose $k$ -th coordinate is $y_{k+1}$ . Furthermore, we define $\nu=\mathbb{P}_{\theta^{*}}^{U}$ , which is the process measure on $\mathcal{Y}$ described in Section 2.2. Then $(\mathcal{Y},T,\nu)$ is an ergodic measure preserving system (see [34, Proposition 6.1] for ergodicity). Now define $\ell:\Theta\times\mathcal{X}\times\mathcal{Y}\to\mathbb{R}$ by $\ell(\theta,x,\{y_{k}\})=-\log\varphi_{\theta}(y_{0}\mid x)$ . Note that the conditions (i)-(iii) on $\ell$ are satisfied by our assumptions on $\varphi$ . Define $\pi_{n}(\cdot\mid y)$ to be the Gibbs posterior defined as in Section 1. Note that in this setting, if $y=\{y_{k}\}$ , then the Gibbs posterior $\pi_{n}(\cdot\mid y)$ is equal to the standard posterior $\Pi_{n}(\cdot\mid y_{0}^{n-1})$ . We require a few lemmas before finishing the proof of the theorem. Before we state the first such lemma, recall that $\ell^{*}$ denotes the $\nu$ -integrable function on $\mathcal{Y}$ appearing in property (ii) in Section 1.3.

Lemma 15.

Let $\theta\in\Theta$ . Then for each $n\geq 1$ and $y\in\mathcal{Y}$ ,

[TABLE]

Proof.

For notation, let

[TABLE]

First suppose that $I_{n}(y)\leq 1$ . Then by Jensen’s inequality and the definition of $\ell^{*}$ ,

[TABLE]

Now suppose that $I_{n}(y)>1$ . Then

[TABLE]

where we have used that both the logarithm and the exponential are increasing. ∎

Lemma 16.

Let $\theta\in\Theta$ . Then

[TABLE]

Proof.

For each $n\geq 1$ , let

[TABLE]

and let $F_{n}(y)=n^{-1}\sum_{k=0}^{n-1}\ell^{*}(T^{k}y)$ . By property (ii), $\ell^{*}$ is $\nu$ -integrable and thus the pointwise ergodic theorem ensures that $F_{n}(y)$ converges for $\nu$ -almost every $y$ to the constant $\mathbb{E}_{\theta^{*}}[\ell^{*}]$ . Furthermore, $\lim_{n}\mathbb{E}_{\theta^{*}}[F_{n}]=\mathbb{E}_{\theta^{*}}[\ell^{*}]$ . By Lemma 15, $|f_{n}|\leq F_{n}$ for each $n\geq 1$ . Therefore, by the generalized Lebesgue dominated convergence theorem and the definition of the loss,

[TABLE]

By Theorem 1, the $\mathbb{P}_{\theta^{*}}$ -almost sure limit of $\{f_{n}\}$ is equal to $V(\theta)$ . Combining these facts, we obtain the desired equality. ∎

Lemma 17.

Suppose $\theta\in\Theta\setminus[\theta^{*}]$ . Then

[TABLE]

Proof.

The well-known large deviations principles for the Gibbs measures $\mu_{\theta}$ [48] imply that they satisfy property (L1) from [34]. By hypothesis, $g$ satisfies the regularity of observations property (L2) from [34]. Then results from [34] (in particular Propositions 4.3 and 6.4) yield the desired inequality. ∎

We now proceed with the proof of Theorem 4. Recall that for $y=\{y_{k}\}\in\mathcal{Y}$ our choice of loss function ensures that the Bayesian posterior $\Pi_{n}(\cdot\mid y_{0}^{n-1})$ is equal to the Gibbs posterior $\pi_{n}(\cdot\mid y)$ . By Theorem 2, the Gibbs posterior $\pi_{n}(\cdot\mid Y)$ concentrates $\nu$ -almost surely around the set $\Theta_{\min}$ , defined as the set of $\theta\in\Theta$ such that $V(\theta)=\inf\{V(\theta^{\prime}):\theta^{\prime}\in\Theta\}$ . Hence $\Pi_{n}(\cdot\mid Y_{0}^{n-1})$ concentrates $\mathbb{P}_{\theta^{*}}^{U}$ -almost surely around $\Theta_{\min}$ . It remains to show that $\Theta_{\min}=[\theta^{*}]$ .

Suppose $\theta\in\Theta\setminus[\theta^{*}]$ . Then by Lemmas 16 and 17, we have

[TABLE]

It follows immediately that $\Theta_{\min}\subset[\theta^{*}]$ . For the reverse inclusion, note that if $\theta\in[\theta^{*}]$ , then $\mathbb{P}_{\theta}^{U}=\mathbb{P}_{\theta^{*}}^{U}$ , and thus for each $n$ ,

[TABLE]

Then Lemma 16 gives that $V(\theta)=V(\theta^{*})$ for each $\theta\in[\theta^{*}]$ . This concludes the proof of Theorem 4. $\Box$

11. Additional results

In this section we collect some auxiliary results about Gibbs posterior inference. We begin with a converse to Theorem 2 on the exponential scale: if $U$ is an open set intersecting $\Theta_{\min}$ , then the Gibbs posterior measure of $U$ cannot be exponentially small as $n$ tends to infinity.

Proposition 8.

Suppose $U\subset\Theta$ is open and $U\cap\Theta_{\min}\neq\varnothing$ . Then for $\nu$ -almost every $y\in\mathcal{Y}$ ,

[TABLE]

Proof.

Let $\theta_{0}\in U\cap\Theta_{\min}$ . By definition of $\Theta_{\min}$ we have $V(\theta_{0})=V_{*}=\inf_{\theta}V(\theta)$ . Fix $\epsilon>0$ and select $\delta>0$ sufficiently small that $\int\rho_{\delta}\,d\nu<\epsilon$ and that the ball $U_{0}$ of radius $\delta$ around $\theta_{0}$ is contained in $U$ . Since $\pi_{0}$ is fully supported, $\pi_{0}(U_{0})>0$ . Note that for each $y\in\mathcal{Y}$ and $n\geq 1$ ,

[TABLE]

Taking logarithms, dividing by $n$ , and letting $n$ tend to infinity yields

[TABLE]

As $\epsilon>0$ was arbitrary, we obtain the desired result. ∎

We now address the Cesàro convergence of the full posterior $P_{n}$ on $\Theta\times\mathcal{X}$ . Recall that we let $I_{\Theta}:\Theta\to\Theta$ be the identity map on $\Theta$ . In the thermodynamic formalism, invariant measures that achieve the optimal value in the variational expression for pressure are called equilibrium measures. In our setting, we introduce terminology for joinings that achieve the optimal value in the variational expression for the rate function. We will call a joining $\lambda\in\mathcal{J}(I_{\Theta}\times S:\nu)$ an equilibrium joining if

[TABLE]

Proposition 9.

For each $y\in\mathcal{Y}$ and $n\geq 1$ , let $Q_{n}(\cdot\mid y)\in\mathcal{M}(\Theta\times\mathcal{X})$ be defined for Borel sets $E\subset\Theta\times\mathcal{X}$ by

[TABLE]

Then for $\nu$ -almost every $y\in\mathcal{Y}$ , all limit points of $\{Q_{n}(\cdot\mid y)\}_{n\geq 1}$ are $(\Theta\times\mathcal{X})$ -marginals of equilibrium joinings.

Proof.

As in Section 7.2, let

[TABLE]

By definition, $Q_{n}(\cdot\mid y)$ is the $(\Theta\times\mathcal{X})$ -marginal of $\eta_{n}$ . Let $Q$ be a weak limit of the subsequence $\{Q_{n_{k}}(\cdot\mid y)\}_{k\geq 1}$ . By repeating the arguments of Section 7.2, one may show that there is a subsequence $\{n_{k_{j}}\}_{j\geq 1}$ such that $\{\eta_{n_{k_{j}}}\}_{j\geq 1}$ converges weakly to an equilibrium joining $\lambda$ . As $Q$ is necessarily the $(\Theta\times\mathcal{X})$ -marginal of the limit $\lambda$ , the proof is complete. ∎

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Mahsa Allahbakhshi and Anthony Quas. Class degree and relative maximal entropy. Transactions of the American Mathematical Society , 365(3):1347–1368, 2013.
2[2] Masha Allahbakhshi, John Antonioli, and Jisang Yoo. Relative equilibrium states and class degree. Ergodic Theory and Dynamical Systems , pages 1–24, 2017.
3[3] José F. Alves, Vanessa Ramos, and Jaqueline Siqueira. Equilibrium stability for non-uniformly hyperbolic systems. Ergodic Theory and Dynamical Systems , pages 1–24, 2018.
4[4] John Antonioli. Compensation functions for factors of shifts of finite type. Ergodic Theory and Dynamical Systems , 36(2):375–389, 2016.
5[5] Michael Benedicks and Lai-Sang Young. Markov extensions and decay of correlations for certain Hénon maps. Astérisque , 261:13–56, 2000.
6[6] Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(5):1103–1130, 2016.
7[7] Rufus Bowen. Equilibrium States and the Ergodic Theory of Anosov Diffeomorphisms , volume 470. Springer, Berlin, Heidelberg, 1975.
8[8] J-R Chazottes, E Floriani, and R Lima. Relative entropy and identification of Gibbs measures in dynamical systems. Journal of Statistical Physics , 90(3-4):697–725, 1998.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Gibbs posterior convergence and the thermodynamic formalism

Abstract.

1. Introduction

1.1. Observed system

1.2. Model families

Definition 1**.**

1.3. Inference

1.4. Main results

Theorem 1**.**

Remark 1**.**

Theorem 2**.**

Remark 2** (Optimality of Θmin⁡\Theta_{\min}Θmin​).**

Remark 3** (Ground states and MAP).**

Remark 4** (Connections to penalization).**

Remark 5** (Convergence of full Gibbs posteriors).**

Remark 6** (Importance of the Gibbs property).**

Remark 7** (Continuous model systems).**

1.5. Examples of inference settings and associated loss functions

Example 1**.**

Example 2**.**

Example 3**.**

2. Applications

2.1. Direct observations of Gibbs processes

Theorem 3**.**

2.2. Hidden Gibbs processes

Theorem 4**.**

3. Joinings, divergence, and the rate function

3.1. Joinings

3.2. Entropy

3.3. The variational principle for pressure

3.4. Disintegration of measure

Theorem** (Disintegration of measure).**

3.5. Fiber entropy

3.6. Divergence terms

Lemma 1**.**

Definition 2**.**

3.7. The rate function

Definition 3**.**

Remark 8**.**

4. Connections to previous work

5. Technical preliminaries

5.1. Pressure Lemma

Lemma 2**.**

5.2. The space of joinings and the ergodic decomposition

Theorem** (The Ergodic Decomposition).**

Theorem** (Structure of the space of joinings).**

5.3. Disintegration results

Lemma 3**.**

5.4. Limiting average loss

Lemma 4**.**

Proof.

5.5. Fiber entropy

Lemma 5**.**

Lemma 6**.**

5.6. Divergence terms and average information

Lemma 7**.**

Proof.

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

6. Semicontinuity of the rate function and Θmin⁡\Theta_{\min}Θmin​

Proposition 5**.**

Proof.

7. Convergence of the partition function and a variational principle

7.1. Lower bound

Proposition 6**.**

Lemma 10**.**

Proof.

7.2. Upper bound

Proposition 7**.**

Lemma 11**.**

Proof.

Definition 1.

Theorem 1.

Remark 1.

Theorem 2.

Remark 2 (Optimality of $\Theta_{\min}$ ).

Remark 3 (Ground states and MAP).

Remark 4 (Connections to penalization).

Remark 5 (Convergence of full Gibbs posteriors).

Remark 6 (Importance of the Gibbs property).

Remark 7 (Continuous model systems).

Example 1.

Example 2.

Example 3.

Theorem 3.

Theorem 4.

Theorem (Disintegration of measure).

Lemma 1.

Definition 2.

Definition 3.

Remark 8.

Lemma 2.

Theorem (The Ergodic Decomposition).

Theorem (Structure of the space of joinings).

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

6. Semicontinuity of the rate function and $\Theta_{\min}$

Proposition 5.

Proposition 6.

Lemma 10.

Proposition 7.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.

Lemma 15.

Lemma 16.

Lemma 17.

Proposition 8.

Proposition 9.