Bayesian Static Parameter Estimation for Partially Observed Diffusions   via Multilevel Monte Carlo

Ajay Jasra; Kengo Kamatani; Kody J. H. Law; and Yan Zhou

arXiv:1701.05892·stat.CO·January 23, 2017·SIAM J. Sci. Comput.

Bayesian Static Parameter Estimation for Partially Observed Diffusions via Multilevel Monte Carlo

Ajay Jasra, Kengo Kamatani, Kody J. H. Law, and Yan Zhou

PDF

Open Access

TL;DR

This paper introduces a multilevel Monte Carlo approach combined with particle MCMC for efficient Bayesian parameter estimation in discretized partially observed diffusions, reducing computational cost while maintaining accuracy.

Contribution

It develops a novel MLMC method using coupling and importance sampling for Bayesian inference in discretized diffusions, improving efficiency over traditional methods.

Findings

01

Variance of weights is independent of data length

02

Method reduces computational cost for a given mean square error

03

Successfully applied to Ornstein-Uhlenbeck and Langevin processes

Abstract

In this article we consider static Bayesian parameter estimation for partially observed diffusions that are discretely observed. We work under the assumption that one must resort to discretizing the underlying diffusion process, for instance using the Euler-Maruyama method. Given this assumption, we show how one can use Markov chain Monte Carlo (MCMC) and particularly particle MCMC [Andrieu, C., Doucet, A. and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods (with discussion). J. R. Statist. Soc. Ser. B, 72, 269--342] to implement a new approximation of the multilevel (ML) Monte Carlo (MC) collapsing sum identity. Our approach comprises constructing an approximate coupling of the posterior density of the joint distribution over parameter and hidden variables at two different discretization levels and then correcting by an importance sampling method. The variance of the…

Tables1

Table 1. Table 1: Estimated rates of convergence of MSE with respect to cost for various parameters, fitted to the curves in Figure 2 .

Model	Parameter	ML-PMCMC	PMCMC
Ornstein-Uhlenbech process	$θ$	$- 1.022$	$- 1.463$
	$σ$	$- 1.065$	$- 1.522$
Langevin SDE	$θ$	$- 1.060$	$- 1.508$
	$σ$	$- 1.023$	$- 1.481$

Equations132

E_{π_{h_{L}}} [φ (θ, X_{0 : n})] = l = 0 \sum L {E_{π_{h_{l}}} [φ (θ, X_{0 : n})] - E_{π_{h_{l - 1}}} [φ (θ, X_{0 : n})]}

E_{π_{h_{L}}} [φ (θ, X_{0 : n})] = l = 0 \sum L {E_{π_{h_{l}}} [φ (θ, X_{0 : n})] - E_{π_{h_{l - 1}}} [φ (θ, X_{0 : n})]}

MSE = Bias (L, φ)^{2} + l = 0 \sum L \frac{V _{l}}{N _{l}},

MSE = Bias (L, φ)^{2} + l = 0 \sum L \frac{V _{l}}{N _{l}},

Bias (L, φ) = ∣ E_{π_{h_{L}}} [φ (θ, X_{0 : n})] - E_{π} [φ (θ, X_{0 : n})] ∣,

Bias (L, φ) = ∣ E_{π_{h_{L}}} [φ (θ, X_{0 : n})] - E_{π} [φ (θ, X_{0 : n})] ∣,

d X_{t}

d X_{t}

P_{θ, h} (X_{0} \in A) = \int_{A} f_{θ} (x) d x and P_{θ, h} (X_{p} \in A ∣ X_{p - 1} = x_{p - 1}) = \int_{A} f_{θ, h} (x_{p - 1}, x_{p}) d x_{p} p \geq 1

P_{θ, h} (X_{0} \in A) = \int_{A} f_{θ} (x) d x and P_{θ, h} (X_{p} \in A ∣ X_{p - 1} = x_{p - 1}) = \int_{A} f_{θ, h} (x_{p - 1}, x_{p}) d x_{p} p \geq 1

P_{θ, h} (Y_{n} \in B ∣ {X_{k}}_{k \geq 0} = {x_{k}}_{k \geq 0}) = \int_{B} g_{θ} (x_{n}, y_{n}) d y_{n} n \geq 1

P_{θ, h} (Y_{n} \in B ∣ {X_{k}}_{k \geq 0} = {x_{k}}_{k \geq 0}) = \int_{B} g_{θ} (x_{n}, y_{n}) d y_{n} n \geq 1

π_{h} (θ, x_{0 : n}) \propto π_{θ} (θ) f_{θ} (x_{0}) p = 1 \prod n g_{θ} (x_{p}, y_{p}) f_{θ, h} (x_{p - 1}, x_{p}),

π_{h} (θ, x_{0 : n}) \propto π_{θ} (θ) f_{θ} (x_{0}) p = 1 \prod n g_{θ} (x_{p}, y_{p}) f_{θ, h} (x_{p - 1}, x_{p}),

\mathbb{E}_{\pi_{h_{L}}}[\varphi(\theta,X_{0:n})]=\sum_{l=0}^{L}\Big{\{}\mathbb{E}_{\pi_{h_{l}}}[\varphi(\theta,X_{0:n})]-\mathbb{E}_{\pi_{h_{l-1}}}[\varphi(\theta,X_{0:n})]\Big{\}}

\mathbb{E}_{\pi_{h_{L}}}[\varphi(\theta,X_{0:n})]=\sum_{l=0}^{L}\Big{\{}\mathbb{E}_{\pi_{h_{l}}}[\varphi(\theta,X_{0:n})]-\mathbb{E}_{\pi_{h_{l-1}}}[\varphi(\theta,X_{0:n})]\Big{\}}

π_{h, h^{'}} (θ, z_{0 : n}) \propto π_{θ} (θ) ν_{θ} (z_{0}) p = 1 \prod n G_{p, θ} (z_{p}) Q_{θ, h, h^{'}} (z_{p - 1}, z_{p}) .

π_{h, h^{'}} (θ, z_{0 : n}) \propto π_{θ} (θ) ν_{θ} (z_{0}) p = 1 \prod n G_{p, θ} (z_{p}) Q_{θ, h, h^{'}} (z_{p - 1}, z_{p}) .

E_{π_{h}} [φ (θ, X_{0 : n})] - E_{π_{h^{'}}} [φ (θ, X_{0 : n})] =

E_{π_{h}} [φ (θ, X_{0 : n})] - E_{π_{h^{'}}} [φ (θ, X_{0 : n})] =

\frac{E _{π_{h, h^{'}}} [ φ ( θ , X _{0 : n} ) H _{1, θ} ( θ , Z _{0 : n} )]}{E _{π_{h, h^{'}}} [ H _{1, θ} ( θ , Z _{0 : n} )]} - \frac{E _{π_{h, h^{'}}} [ φ ( θ , X _{0 : n}^{'} ) H _{2, θ} ( θ , Z _{0 : n} )]}{E _{π_{h, h^{'}}} [ H _{2, θ} ( θ , Z _{0 : n} )]}

\frac{E _{π_{h, h^{'}}} [ φ ( θ , X _{0 : n} ) H _{1, θ} ( θ , Z _{0 : n} )]}{E _{π_{h, h^{'}}} [ H _{1, θ} ( θ , Z _{0 : n} )]} - \frac{E _{π_{h, h^{'}}} [ φ ( θ , X _{0 : n}^{'} ) H _{2, θ} ( θ , Z _{0 : n} )]}{E _{π_{h, h^{'}}} [ H _{2, θ} ( θ , Z _{0 : n} )]}

H_{1, θ} (θ, z_{0 : n})

H_{1, θ} (θ, z_{0 : n})

H_{2, θ} (θ, z_{0 : n})

\int_{A \times (W ∖ V)} η (d w) = \int_{A} π_{h, h^{'}} (θ, z_{0 : n}) d (θ, z_{0 : n}) .

\int_{A \times (W ∖ V)} η (d w) = \int_{A} π_{h, h^{'}} (θ, z_{0 : n}) d (θ, z_{0 : n}) .

π_{h, h^{'}} (z_{0 : n} ∣ θ) \propto ν_{θ} (z_{0}) p = 1 \prod n G_{p, θ} (z_{p}) Q_{θ, h, h^{'}} (z_{p - 1}, z_{p})

π_{h, h^{'}} (z_{0 : n} ∣ θ) \propto ν_{θ} (z_{0}) p = 1 \prod n G_{p, θ} (z_{p}) Q_{θ, h, h^{'}} (z_{p - 1}, z_{p})

p(a_{0:n-1}^{1:M},z_{0:n}^{1:M}|\theta)=\Big{(}\prod_{i=1}^{M}\nu_{\theta}(z_{0}^{i})\Big{)}\prod_{p=1}^{n}\prod_{i=1}^{M}\Big{(}\frac{G_{p-1,\theta}(z_{p-1}^{a_{p-1}^{i}})}{\sum_{j=1}^{M}G_{p-1,\theta}(z_{p-1}^{j})}Q_{\theta,h,h^{\prime}}(z_{p-1}^{a_{p-1}^{i}},z_{p}^{i})\Big{)}\ ,

p(a_{0:n-1}^{1:M},z_{0:n}^{1:M}|\theta)=\Big{(}\prod_{i=1}^{M}\nu_{\theta}(z_{0}^{i})\Big{)}\prod_{p=1}^{n}\prod_{i=1}^{M}\Big{(}\frac{G_{p-1,\theta}(z_{p-1}^{a_{p-1}^{i}})}{\sum_{j=1}^{M}G_{p-1,\theta}(z_{p-1}^{j})}Q_{\theta,h,h^{\prime}}(z_{p-1}^{a_{p-1}^{i}},z_{p}^{i})\Big{)}\ ,

p^{M}_{h,h^{\prime}}(y_{0:n}|\theta)=\prod_{p=1}^{n}\Big{(}\frac{1}{M}\sum_{j=1}^{M}G_{p,\theta}(z_{p}^{j})\Big{)}

p^{M}_{h,h^{\prime}}(y_{0:n}|\theta)=\prod_{p=1}^{n}\Big{(}\frac{1}{M}\sum_{j=1}^{M}G_{p,\theta}(z_{p}^{j})\Big{)}

1 \land \frac{p _{h, h^{'}}^{M} ( y _{0 : n} ∣ θ ^{'} )}{p _{h, h^{'}}^{M} ( y _{0 : n} ∣ θ ^{i - 1} )} \frac{π _{θ} ( θ ^{'} ) q ( θ ^{i - 1} ∣ θ ^{'} )}{π _{θ} ( θ ^{i - 1} ) q ( θ ^{'} ∣ θ ^{i - 1} )}

1 \land \frac{p _{h, h^{'}}^{M} ( y _{0 : n} ∣ θ ^{'} )}{p _{h, h^{'}}^{M} ( y _{0 : n} ∣ θ ^{i - 1} )} \frac{π _{θ} ( θ ^{'} ) q ( θ ^{i - 1} ∣ θ ^{'} )}{π _{θ} ( θ ^{i - 1} ) q ( θ ^{'} ∣ θ ^{i - 1} )}

\frac{\frac{1}{N} \sum _{i = 1}^{N} φ ( θ ^{i} , x _{0 : n}^{k^{i}} ) H _{1, θ^{i}} ( θ ^{i} , z _{0 : n}^{k^{i}} )}{\frac{1}{N} \sum _{i = 1}^{N} H _{1, θ^{i}} ( θ ^{i} , z _{0 : n}^{k^{i}} )} - \frac{\frac{1}{N} \sum _{i = 1}^{N} φ ( θ ^{i} , x _{0 : n}^{' k^{i}} ) H _{2, θ^{i}} ( θ ^{i} , z _{0 : n}^{k^{i}} )}{\frac{1}{N} \sum _{i = 1}^{N} H _{2, θ^{i}} ( θ ^{i} , z _{0 : n}^{k^{i}} )} .

\frac{\frac{1}{N} \sum _{i = 1}^{N} φ ( θ ^{i} , x _{0 : n}^{k^{i}} ) H _{1, θ^{i}} ( θ ^{i} , z _{0 : n}^{k^{i}} )}{\frac{1}{N} \sum _{i = 1}^{N} H _{1, θ^{i}} ( θ ^{i} , z _{0 : n}^{k^{i}} )} - \frac{\frac{1}{N} \sum _{i = 1}^{N} φ ( θ ^{i} , x _{0 : n}^{' k^{i}} ) H _{2, θ^{i}} ( θ ^{i} , z _{0 : n}^{k^{i}} )}{\frac{1}{N} \sum _{i = 1}^{N} H _{2, θ^{i}} ( θ ^{i} , z _{0 : n}^{k^{i}} )} .

l = 0 \sum L \overset{ˉ}{E}_{l}^{N_{l}} (φ), \overset{ˉ}{E}_{l}^{N_{l}} (φ) = E_{l}^{N_{l}} (φ) - E_{l} (φ),

l = 0 \sum L \overset{ˉ}{E}_{l}^{N_{l}} (φ), \overset{ˉ}{E}_{l}^{N_{l}} (φ) = E_{l}^{N_{l}} (φ) - E_{l} (φ),

E_{l}^{N_{l}} (φ) = \frac{\frac{1}{N _{l}} \sum _{i = 1}^{N_{l}} φ ( θ ^{i} , x _{0 : n}^{i} ) H _{1, θ^{i}} ( θ ^{i} , z _{0 : n}^{i} )}{\frac{1}{N _{l}} \sum _{i = 1}^{N_{l}} H _{1, θ^{i}} ( θ ^{i} , z _{0 : n}^{i} )} - \frac{\frac{1}{N _{l}} \sum _{i = 1}^{N_{l}} φ ( θ ^{i} , x _{0 : n}^{' i} ) H _{2, θ^{i}} ( θ ^{i} , z _{0 : n}^{i} )}{\frac{1}{N _{l}} \sum _{i = 1}^{N_{l}} H _{2, θ^{i}} ( θ ^{i} , z _{0 : n}^{i} )}

E_{l}^{N_{l}} (φ) = \frac{\frac{1}{N _{l}} \sum _{i = 1}^{N_{l}} φ ( θ ^{i} , x _{0 : n}^{i} ) H _{1, θ^{i}} ( θ ^{i} , z _{0 : n}^{i} )}{\frac{1}{N _{l}} \sum _{i = 1}^{N_{l}} H _{1, θ^{i}} ( θ ^{i} , z _{0 : n}^{i} )} - \frac{\frac{1}{N _{l}} \sum _{i = 1}^{N_{l}} φ ( θ ^{i} , x _{0 : n}^{' i} ) H _{2, θ^{i}} ( θ ^{i} , z _{0 : n}^{i} )}{\frac{1}{N _{l}} \sum _{i = 1}^{N_{l}} H _{2, θ^{i}} ( θ ^{i} , z _{0 : n}^{i} )}

E [(l = 0 \sum L \overset{ˉ}{E}_{l}^{N_{l}} (φ))^{2}] = l = 0 \sum L E [\overset{ˉ}{E}_{l}^{N_{l}} (φ)^{2}]

E [(l = 0 \sum L \overset{ˉ}{E}_{l}^{N_{l}} (φ))^{2}] = l = 0 \sum L E [\overset{ˉ}{E}_{l}^{N_{l}} (φ)^{2}]

\underline{C} \leq g_{θ} (x, y) \leq \overline{C} .

\underline{C} \leq g_{θ} (x, y) \leq \overline{C} .

(\int_{Θ \times X^{2 k + 2}} ∣ φ (θ, x_{0 : k}) - φ (θ, x_{0 : k}^{'}) ∣^{q} p = 1 \prod k Q_{θ, h, h^{'}} (z_{k - 1}, z_{k}) π_{θ} (θ) ν_{θ} (z_{0}) d θ d z_{0 : k})^{3 - q} \leq C (h^{'})^{β} .

(\int_{Θ \times X^{2 k + 2}} ∣ φ (θ, x_{0 : k}) - φ (θ, x_{0 : k}^{'}) ∣^{q} p = 1 \prod k Q_{θ, h, h^{'}} (z_{k - 1}, z_{k}) π_{θ} (θ) ν_{θ} (z_{0}) d θ d z_{0 : k})^{3 - q} \leq C (h^{'})^{β} .

\int_{W} φ (w^{'}) K (w, d w^{'}) \geq ξ \int_{W} φ (w) ν (d w) .

\int_{W} φ (w^{'}) K (w, d w^{'}) \geq ξ \int_{W} φ (w) ν (d w) .

\mathbb{E}\Bigg{[}\Bigg{(}\frac{\frac{1}{N}\sum_{i=1}^{N}\varphi(\theta^{i},x_{0:n}^{i})H_{1,\theta^{i}}(\theta^{i},z_{0:n}^{i})}{\frac{1}{N}\sum_{i=1}^{N}H_{1,\theta^{i}}(\theta^{i},z_{0:n}^{i})}-\frac{\frac{1}{N}\sum_{i=1}^{N}\varphi(\theta^{i},x_{0:n}^{\prime i})H_{2,\theta^{i}}(\theta^{i},z_{0:n}^{i})}{\frac{1}{N}\sum_{i=1}^{N}H_{2,\theta^{i}}(\theta^{i},z_{0:n}^{i})}

\mathbb{E}\Bigg{[}\Bigg{(}\frac{\frac{1}{N}\sum_{i=1}^{N}\varphi(\theta^{i},x_{0:n}^{i})H_{1,\theta^{i}}(\theta^{i},z_{0:n}^{i})}{\frac{1}{N}\sum_{i=1}^{N}H_{1,\theta^{i}}(\theta^{i},z_{0:n}^{i})}-\frac{\frac{1}{N}\sum_{i=1}^{N}\varphi(\theta^{i},x_{0:n}^{\prime i})H_{2,\theta^{i}}(\theta^{i},z_{0:n}^{i})}{\frac{1}{N}\sum_{i=1}^{N}H_{2,\theta^{i}}(\theta^{i},z_{0:n}^{i})}

-\Bigg{(}\frac{\mathbb{E}_{\pi_{h,h^{\prime}}}[\varphi(\theta,X_{0:n})H_{1,\theta}(\theta,Z_{0:n})]}{\mathbb{E}_{\pi_{h,h^{\prime}}}[H_{1,\theta}(\theta,Z_{0:n})]}-\frac{\mathbb{E}_{\pi_{h,h^{\prime}}}[\varphi(\theta,X_{0:n}^{\prime})H_{2,\theta}(\theta,Z_{0:n})]}{\mathbb{E}_{\pi_{h,h^{\prime}}}[H_{2,\theta}(\theta,Z_{0:n})]}\Bigg{)}\Bigg{)}^{2}\Bigg{]}\leq\frac{C(h^{\prime})^{\beta}}{N}.

-\Bigg{(}\frac{\mathbb{E}_{\pi_{h,h^{\prime}}}[\varphi(\theta,X_{0:n})H_{1,\theta}(\theta,Z_{0:n})]}{\mathbb{E}_{\pi_{h,h^{\prime}}}[H_{1,\theta}(\theta,Z_{0:n})]}-\frac{\mathbb{E}_{\pi_{h,h^{\prime}}}[\varphi(\theta,X_{0:n}^{\prime})H_{2,\theta}(\theta,Z_{0:n})]}{\mathbb{E}_{\pi_{h,h^{\prime}}}[H_{2,\theta}(\theta,Z_{0:n})]}\Bigg{)}\Bigg{)}^{2}\Bigg{]}\leq\frac{C(h^{\prime})^{\beta}}{N}.

∣ E_{π_{h_{L}}} (φ (θ, X_{0 : n})) - E_{π} (φ (θ, X_{0 : n})) ∣ \leq C h_{L}^{α},

∣ E_{π_{h_{L}}} (φ (θ, X_{0 : n})) - E_{π} (φ (θ, X_{0 : n})) ∣ \leq C h_{L}^{α},

l = 0 \sum L E [\overset{ˉ}{E}_{l}^{N_{l}} (φ)^{2}] \leq C l = 0 \sum L \frac{h _{l}^{β}}{N _{l}},

l = 0 \sum L E [\overset{ˉ}{E}_{l}^{N_{l}} (φ)^{2}] \leq C l = 0 \sum L \frac{h _{l}^{β}}{N _{l}},

N_{l} \propto ε^{- 2} K_{L} h_{l}^{(β + γ) /2},

N_{l} \propto ε^{- 2} K_{L} h_{l}^{(β + γ) /2},

E [∣ l = 1 \sum L E_{l}^{N_{l}} (φ) - E_{π} (φ (θ, X_{0 : n})) ∣^{2}] \leq C ϵ^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Statistical Methods and Bayesian Inference · Statistical Methods and Inference

Full text

Bayesian Static Parameter Estimation for Partially

Observed Diffusions via Multilevel Monte Carlo

Ajay Jasra [email protected]

Kengo Kamatani [email protected] Department of Engineering Science, Osaka University, JP

Kody J. H. Law [email protected] Computer Science and Mathematics Division, Oak Ridge National Laboratory, USA

Yan Zhou [email protected]

Abstract

In this article we consider static Bayesian parameter estimation for partially observed diffusions that are discretely observed. We work under the assumption that one must resort to discretizing the underlying diffusion process, for instance using the Euler Maruyama method. Given this assumption, we show how one can use Markov chain Monte Carlo (MCMC) and particularly particle MCMC [Andrieu, C., Doucet, A. & Holenstein, R. (2010). Particle Markov chain Monte Carlo methods (with discussion). J. R. Statist. Soc. Ser. B, 72, 269–342] to implement a new approximation of the multilevel (ML) Monte Carlo (MC) collapsing sum identity. Our approach comprises constructing an approximate coupling of the posterior density of the joint distribution over parameter and hidden variables at two different discretization levels and then correcting by an importance sampling method. The variance of the weights are independent of the length of the observed data set. The utility of such a method is that, for a prescribed level of mean square error, the cost of this MLMC method is provably less than i.i.d. sampling from the posterior associated to the most precise discretization. However the method here comprises using only known and efficient simulation methodologies. The theoretical results are illustrated by inference of the parameters of two prototypical processes given noisy partial observations of the process: the first is an Ornstein Uhlenbeck process and the second is a more general Langevin equation.

Key words: Multilevel Monte Carlo, Markov chain Monte Carlo, Diffusion Processes

1 Introduction

The Hidden Markov Model (HMM) is widely used in many disciplines, including applied mathematic, statistics, economics and finance; see [2] for an overview. In this article, we are interested in HMMs given by diffusions which are partially observed, discretely in time. In particular, we assume that in order to fit the model to the data, one must resort to a discretization of the diffusion, for instance, using Euler-Maruyama. In addition, we assume that associated to the model is a static (non-time-varying) finite dimensional parameter, which one is interested to infer given a fixed length data record. In simple terms, the discretization, of level $h$ say, where as $h\rightarrow 0$ one obtains the exact diffusion, induces a posterior say $\pi_{h}$ on the static parameter $\theta$ and hidden states at the observation times, say $X_{0:n}$ . We seek to approximate $\mathbb{E}_{\pi_{h}}[\varphi(\theta,X_{0:n})]$ for appropriately defined real-valued functions. Ultimately, one might seek to remove the dependence upon $h$ and get the exact expectation with no discretization bias. We remark that the model will be formally introduced in the next section. This framework is relevant to a broad range of applications in science and engineering; see [2, 17]

The task of computing the expectation for any fixed $h>0$ is a non-trivial task, which often requires quite advanced Monte Carlo methods. As has been remarked in many articles in the literature, ofen the joint correlation between $\theta$ and $X_{0:n}$ means even standard MCMC methods may produce very inaccurate of inefficient approximations of the expectation of interest, despite their theoretical validity. An important algorithm that has, to an extent, helped to alleviate these difficulties is the particle MCMC (PMCMC) methods of [1] and their subsequent developments (e.g. [4]). Intrinsically, this method uses a sequential Monte Carlo (SMC) (e.g. [7]) method to help move the samples around the state-space, for instance, inside a Metropolis-Hastings acceptance/rejection scheme, although Gibbs versions also exist. PMCMC delivers a Markov chain which provides consistent estimates of expectations of the form $\mathbb{E}_{\pi_{h}}[\varphi(\theta,X_{0:n})]$ , for any fixed $h$ SMC methods are well-known as being efficient techniques for filtering, when the state-variable at time $k$ , $X_{k}$ , is of moderate to low dimension and all the static parameters are fixed.

In the context of this article, there is an additional degree of freedom, which can be utilized to further enhance the PMCMC method. This is associated to the discretization level $h$ . We consider using the multilevel Monte Carlo (MLMC) framework [8, 9, 11]. This allows one to leverage in an optimal way the nested problems arising in this context, hence minimizing the necessary cost to obtain a given level of mean square error. Set $\pi$ as the posterior on $\theta,X_{0:n}$ with no discretization bias and $\pi_{h_{l}}$ as the time-discretized posterior on $\theta,X_{0:n}$ with time discretization $h_{l}$ , one has for an intergrable, real-valued function $\varphi$ and $+\infty>h_{0}>h_{1}>\cdots>h_{L}>0$ (the levels)

[TABLE]

where $\mathbb{E}$ is the expectation operator and $\mathbb{E}_{\pi_{h_{-1}}}[\varphi(\theta,X_{0:n})]:=0$ . The idea of MLMC is then to approximate each summand by independently simulating $N_{l}$ samples from a dependent coupling of $(\pi_{h_{l}},\pi_{h_{l-1}})$ . In such scenarios, one can show that the overall mean square error (MSE) associated to the approximation of $\mathbb{E}_{\pi}[\varphi(\theta,X_{0:n})]$ is:

[TABLE]

where

[TABLE]

and $0<V_{l}<+\infty$ are a collection of constants. It is remarked that it is the coupled samples which induce $V_{l}$ to be a function of $h_{l}$ which is often critical as we explain below. Assuming the cost of $C_{l}$ per level, per sample, the cost of the algorithm is then $\sum_{l=0}^{L}C_{l}N_{l}$ . Fixing $\epsilon>0$ and given an appropriate parameterization of $h_{l}$ (e.g. $h_{l}=2^{-l}$ ), one then chooses $L$ to ensure that $\textrm{Bias}(L,\varphi)^{2}=\mathcal{O}(\epsilon^{2})$ and then given $C_{l},V_{l}$ characterised as a function of $h_{l}$ optimizes $N_{0},\dots,N_{L}$ to minimize the cost so that the term $\sum_{l=0}^{L}\frac{V_{l}}{N_{l}}=\mathcal{O}(\epsilon^{2})$ ; [8] gives the solution to this constrained optimization problem. In many scenarios of practical interest the associated MLMC algorithm can achieve a MSE of $\mathcal{O}(\epsilon^{2})$ at a cost which is less than i.i.d. sampling from $\pi_{h_{L}}$ ; note that this has not yet been established in the problem under study here. The main issue is that sampling independently from the couples $(\pi_{h_{l}},\pi_{h_{l-1}})$ is not possible in our context.

In this paper we show how to implement a new approximation of the multilevel collapsing sum identity. Our approach comprises constructing an approximate coupling of the posterior density of the joint on the parameter and hidden space at two different discretization levels and then correcting by an importance sampling method, whose variance of the weights are independent of the length of the observed data set. The utility of such a method is that it comprises using known and efficient simulation methodologies, instead of coupling algorithms as explored in [13, 14, 15, 19]. In particular, our approach facilitates a mathematical analysis which allows us to establish that our approach can be better than sampling (e.g. by PMCMC) from the posterior associated to the most precise discretization. The algorithm presented here is distinct from either of the previously introduced multilevel MCMC (MLMCMC) algorithms [12, 16], and may be generalized.

This article is structured as follows. In Section 2 the model is described. In Section 3 we describe our approach and give a mathematical result associated to the MSE of the method. In Section 4 we give practical simulations to establish the theory. The appendix contains some of the proofs for the result of Section 3.

2 Model

We consider the following partially-observed diffusion process:

[TABLE]

with $X_{t}\in\mathbb{R}^{d}=\mathsf{X}$ , $t\geq 0$ , $X_{0}$ has initial probability density $f_{\theta}$ and $\{W_{t}\}_{t\in[0,T]}$ a Brownian motion of appropriate dimension. $\theta\in\Theta\subseteq\mathbb{R}^{d_{\theta}}$ is a static parameter of interest. The following assumptions will be made on the diffusion process.

Assumption 2.1.

$a_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ , $b_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d}$ satisfy

(i)

global Lipschitz property*: there is a $C>0$ such that $|a_{\theta}(x)-a_{\theta}(y)|+|b_{\theta}(x)-b_{\theta}(y)|\leq C|x-y|$ for all $x,y\in\mathsf{X}$ and all $\theta\in\Theta$ ;*

(ii)

bounded moments*: $\sup_{\theta\in\Theta}\mathbb{E}_{\theta}|X_{0}|^{p}<\infty$ for all $p\geq 1.$ *

Notice that (i) and (ii) together imply that $\mathbb{E}_{\theta}|X_{n}|^{p}<\infty$ for all $n$ .

It will be assumed that the data are regularly spaced (i.e. in discrete time) observations $y_{1},\dots,y_{n}$ , $y_{k}\in\mathbb{R}^{m}=\mathsf{Y}$ . It is assumed that conditional on $X_{k\delta}$ , for discretization $\delta>0$ , $Y_{k}$ is independent of all other random variables with density $g_{\theta}(x_{k\delta},y_{k})$ . For simplicity of notation let $\delta=1$ (which can always be done by rescaling time), so $X_{k}=X_{k\delta}$ . It is noted that we assume that one does not have access to a non-negative and unbiased estimate of the transition density of the diffusion and we are forced to work with a discretized process.

The above formulation can then summarized as follows, on discretizing the diffusion process with discretization level $h$ . We have a pair of discrete-time stochastic processes, $\left\{X_{n}\right\}_{n\mathbb{\geq}0}$ and $\left\{Y_{n}\right\}_{n\geq 1}$ , where $X_{n}\in\mathsf{X}$ (with associated $\sigma-$ algebra $\mathcal{X}$ ) is an unobserved process and $y_{n}\in\mathsf{Y}$ (with associated $\sigma-$ algebra $\mathcal{Y}$ ) is observed. Let $\theta\in\Theta\subseteq\mathbb{R}^{d_{\theta}}$ be a parameter . The hidden process $\left\{X_{n}\right\}$ is a Markov chain with initial density $f_{\theta}$ at time [math] and transition density $f_{\theta,h}\left(x_{p-1},x_{p}\right)$ , i.e. for each $\theta\in\Theta$

[TABLE]

where $\mathbb{P}_{\theta,h}$ denotes probability, $A\in\mathcal{X}$ and $dx_{n}$ is a dominating $\sigma$ -finite measure. In addition, the observations $\left\{Y_{n}\right\}_{n\geq 1}$ conditioned upon $\left\{X_{n}\right\}_{n\mathbb{\geq}0}$ are statistically independent and have marginal density $g_{\theta}\left(x_{n},y_{n}\right)$ , i.e.

[TABLE]

with $B\in\mathcal{Y}$ and $dy_{n}$ the dominating $\sigma$ -finite measure. The HMM is given by equations (5)-(6) and is often referred to in the literature as a state-space model. In our context $\theta\in\Theta$ is a parameter of interest with prior $\pi_{\theta}$ .

Given the joint density on $\mathsf{U}:=\Theta\times\mathsf{X}^{n+1}$

[TABLE]

for $\varphi\in\mathcal{B}_{b}(\mathsf{U})\cap\textrm{Lip}(\mathsf{U})$ , where $\mathcal{B}_{b}(\mathsf{U})$ are the bounded and real-valued measurable functions on $\mathsf{U}$ and $\textrm{Lip}(\mathsf{U})$ are the Lipschitz, measurable functions on $\mathsf{U}$ , and for $+\infty>h_{0}>\cdots>h_{L}>0$ we would like to compute

[TABLE]

where $\mathbb{E}_{\pi_{h_{-1}}}[\cdot]=0$ . We will use the MLMC approach.

Consider only a single pair $\mathbb{E}_{\pi_{h}}[\varphi(\theta,X_{0:n})]-\mathbb{E}_{\pi_{h^{\prime}}}[\varphi(\theta,X_{0:n})]$ , $h<h^{\prime}$ . It is well known that if one can sample from a dependent coupling of $(\pi_{h},\pi_{h^{\prime}})$ , such as the maximal coupling, then Monte Carlo estimation of such a difference can be performed at a lower cost than i.i.d sampling from the independent coupling of $(\pi_{h},\pi_{h^{\prime}})$ [8, 9]. The main issue is that such couplings are typically not available up-to a non-negative and unbiased estimator. We consider the scenario where one samples from a sensible, approximate, coupling and corrects via importance sampling.

3 Method and Analysis

3.1 Method

We are to approximate the identity (7). Our procedure, when considering the summands from $1,\dots,L$ will be to run $L$ independent pairs of the idea to be described below. The case $l=0$ is simply using (e.g.) PMCMC to approximate $\mathbb{E}_{\pi_{h_{0}}}[\varphi(\theta,X_{0:n})]$ ; we refer the reader to [1] for details on PMCMC - a simple decsription is below. We only consider a pair $\mathbb{E}_{\pi_{h}}[\varphi(\theta,X_{0:n})]-\mathbb{E}_{\pi_{h^{\prime}}}[\varphi(\theta,X_{0:n})]$ , $h<h^{\prime}$ . The methodology and analysis in this context of one pair will suffice to justify our approach as we will explain below.

Let $z=(x,x^{\prime})\in\mathsf{X}\times\mathsf{X}=\mathsf{Z}$ and $Q_{\theta,h,h^{\prime}}(z,\bar{z})$ be any coupling (other than the independent one) of $(f_{\theta,h}(x,\bar{x}),f_{\theta,h^{\prime}}(x^{\prime},\bar{x}^{\prime}))$ . For instance, in the context of an Euler discretization a description can be found in [15] (see also appendix B). Let $G_{p,\theta}(z)=\max\{g_{\theta}(x,y_{p}),g_{\theta}(x^{\prime},y_{p})\}$ (note that alternative choices of $G_{p,\theta}$ are possible). We propose to sample from the probability density on $\mathsf{V}=\Theta\times\mathsf{X}^{2n+2}$ (write the associated $\sigma-$ algebra as $\mathcal{V}$ )

[TABLE]

Then for $\varphi\in\mathcal{B}_{b}(\mathsf{U})\cap\textrm{Lip}(\mathsf{U})$ :

[TABLE]

where

[TABLE]

We note that our choice of $G_{p,\theta}(z)$ ensures that $H_{1,\theta}$ and $H_{2,\theta}$ are uniformly upper-bounded by 1 and hence that the variance w.r.t. any probability is independent of $n$ .

3.1.1 Particle MCMC

Let $(\mathsf{W},\mathcal{W})$ be a measurable space such that $\mathsf{V}\subseteq\mathsf{W}$ . Let $K:\mathsf{W}\times\mathcal{W}\rightarrow[0,1]$ be any ergodic Markov kernel of invariant measure $\eta$ such that one can consistently estimate expectations w.r.t. $\pi_{h,h^{\prime}}$ . For instance, if for every $A\in\mathcal{V}$

[TABLE]

Our construction allows a particle MCMC approach to be adopted, which is not quite as the displayed equation, but nonetheless allows one to infer $\pi_{h,h^{\prime}}$ . We focus on one particle MCMC method for completeness, but, we reiterate that one can use the analysis here for more advanced versions of the algorithm, or indeed, any MCMC of the form above.

We will now describe the particle marginal Metropolis-Hastings (PMMH) algorithm. Let $M\geq 1$ and $\theta$ be fixed, and introduce random variables $a_{0:n-1}\in\{1,\dots,M\}^{n}$ , which will denote the indices of the selected particles upon resampling at the given steps. One can run a particle filter [5] to approximate

[TABLE]

by sampling from the following joint, on the space $\{1,\dots,M\}^{n}\times\mathsf{Z}^{M(n+1)}$

[TABLE]

where $G_{0,\theta}:=1$ . Note that better algorithms can be constructed, but we just present the most simple approach. We remark that

[TABLE]

is an unbiased estimator of $p_{h,h^{\prime}}(y_{0:n}|\theta)=\int_{\mathsf{Z}^{n+1}}\nu_{\theta}(z_{0})\prod_{p=1}^{n}G_{p,\theta}(z_{p})Q_{\theta,h,h^{\prime}}(z_{p-1},z_{p})dz_{0:n}$ ; see [5].

The PMMH algorithm works as follows. The superscripts for $(\theta,k)$ are the iteration (time) counter of the MCMC.

Initialize: Sample $\theta^{0}$ from the prior and then sample $(a_{0:n-1}^{1:M},z_{0:n}^{1:M})$ from $p(a_{0:n-1}^{1:M},z_{0:n}^{1:M}|\theta^{0})$ as in (9), and store $p^{M}_{h,h^{\prime}}(y_{0:n}|\theta^{0})$ as in (10). Select a path $z_{0:n}^{j}$ , constructed by drawing $z_{n}^{j}$ with probability proportional to $G_{n,\theta^{0}}(z_{n}^{j})$ , and setting $(z^{j^{\prime}}_{p-1}|z^{j^{\prime}}_{p})=z_{p-1}^{a_{p-1}^{j^{\prime}}}$ ; set $k^{0}$ as the index of the selected path. Set $i=1$ . 2. 2.

Iterate: Sample $\theta^{\prime}|\theta^{i-1}$ according to a proposal with conditional density $q(\theta^{\prime}|\theta^{i-1})$ then from $p(a_{0:n-1}^{1:M},z_{0:n}^{1:M}|\theta^{\prime})$ as in (9). Select a path $z_{0:n}^{j}$ with probability proportional to $G_{n,\theta^{\prime}}(z_{n}^{j})$ and constructed as described above; set $k^{\prime}$ as the index of the selected path. Set $\theta^{i}=\theta^{\prime}$ , $k^{i}=k^{\prime}$ with probability:

[TABLE]

otherwise $\theta^{i}=\theta^{i-1}$ , $k^{i}=k^{i-1}$ . Set $i=i+1$ and return to the start of 2.

We denote by $K$ the PMMH kernel and denote by $(\mathsf{W},\mathcal{W})$ the measurable space for which it is defined upon. The invariant measure is denoted $\eta$ . For the analysis, we assume the MCMC algorithm is started in stationarity.

Then one estimates (8) by

[TABLE]

This estimate is consistent in the limit as $N$ grows; see [1]. To simplify the notation we replace $k^{i}$ in the superscripts by $i$ from here on.

3.2 Multilevel Considerations

As described for MLMC in the introduction, we will approximate the expectation using the telescopic sum identity given in (1). We will establish error estimates for

[TABLE]

where

[TABLE]

is a consistent estimator of $E_{l}(\varphi):=\mathbb{E}_{\pi_{h_{l}}}[\varphi(\theta,X_{0:n})]-\mathbb{E}_{\pi_{h_{l-1}}}[\varphi(\theta,X_{0:n})]$ . Therefore (11) is a consistent estimator of $\mathbb{E}_{\pi_{h_{L}}}[\varphi(\theta,X_{0:n})]$ and the the MSE (2) can be bounded, up to a constant, by the sum of the squared error of (11) and Bias $(L,\varphi)^{2}$ , as given by (3), which is $\mathcal{O}(h_{L})$ for example using Euler Maruyama.

Using $\mathbb{E}$ to denote the expectation w.r.t. the law associated to our algorithm, assuming the Markov chain is started in stationarity, our objective is therefore to investigate

[TABLE]

so as to optimally allocate $N_{0},\dots,N_{L}$ as described in the introduction. Thus we must investigate terms such as $\mathbb{E}[\bar{E}_{l}^{N_{l}}(\varphi)^{2}]$ for a given $l$ .

3.3 Analysis

Below $\mathcal{P}(\mathsf{W})$ are the collection of probability measures on $(\mathsf{W},\mathcal{W})$ .

(A1)

For every $y\in\mathsf{Y}$ there exist $0<\underline{C}<\overline{C}<+\infty$ such that for every $x\in\mathsf{X}$ , $\theta\in\Theta$ ,

[TABLE]

For every $y\in\mathsf{Y}$ , $g_{\theta}(x,y)$ is globally Lipschitz on $\mathsf{X}\times\Theta$ .

(A2)

For any $0\leq k\leq n$ , $q\in\{1,2\}$ there exists a $\beta>0$ such that for any $\varphi\in\mathcal{B}_{b}(\Theta\times\mathsf{X}^{k+1})\cap\textrm{Lip}(\Theta\times\mathsf{X}^{k+1})$ there exists a $C<+\infty$

[TABLE]

(A3)

Suppose that for any $n>0$ there exist a $\xi\in(0,1)$ and $\nu\in\mathcal{P}(\mathsf{W})$ such that for each $w\in\mathsf{W}$ , $\varphi\in\mathcal{B}_{b}(\mathsf{W})\cap\textrm{Lip}(\mathsf{W})$ , $h,h^{\prime}$ :

[TABLE]

$K$ is $\eta$ -reversible, that is, $\int_{w\in B}\eta(dw)K(w,A)=\int_{w\in A}\eta(dw)K(w,B)$ for any $A,B\in\mathcal{W}$ .

We note that (A(A1)) can be verified for some state-space models (especially if $\mathsf{Y}$ and $\Theta$ are compact) and (A(A3)) can be verified for a PMCMC kernel, if $\Theta,\mathsf{X}$ are compact - indeed, the constants would all be independent of $n$ under appropriate settings of the algorithm.

Theorem 3.1.

Assume (A(A1)-(A3)). Then for any $n>0$ , there exists a $\beta>0$ such that for any $\varphi\in\mathcal{B}_{b}(\Theta\times\mathsf{X}^{n+1})\cap\textrm{Lip}(\Theta\times\mathsf{X}^{n+1})$ there exists a $C<+\infty$ such that

[TABLE]

Proof.

The result follows by using Lemma C.3. of [14], the $C_{2}-$ inequality, the boundedness of certain quantities and Proposition A.1.The proof is omitted as it is similar to the calculations in [14]. ∎

3.4 A Return to Multilevel Considerations

Returning to Section 3.2, we assume that $h_{l}=2^{-l}$ and introduce the further assumption

Assumption 3.1.

The cost to simulate $E_{l}^{N_{l}}$ in (12) is controlled by $\mathsf{C}(E_{l}^{N_{l}})\leq CN_{l}h_{l}^{-\gamma}$ , and the bias is controlled by

[TABLE]

for $\gamma,\alpha,C>0$ .

Following assumption (A(A2)), $\alpha=\beta/2$ satisfies the above, but it may be larger, e.g. for Euler-Maruyama in which $\alpha=\beta$ .

Given $\epsilon>0$ , in order to ensure the MSE is $\mathcal{O}(\epsilon^{2})$ , the term (3) must be $\mathcal{O}(\epsilon^{2})$ . Following from Assumption (A(A2)), it suffices to let $L\propto{2|\log(\epsilon)|}/{\beta}$ so that $h_{L}=\epsilon$ .

Following from Theorem 3.1,

[TABLE]

and note that the constant $C$ may depend upon the time parameter $n$ , which has been suppressed from the notation; we return to this point below.

Suppose we minimize COST $=\sum_{l=0}^{L}h_{l}^{-\gamma}N_{l}$ subject to $\sum_{l=0}^{L}\frac{h_{l}^{\beta}}{N_{l}}=\mathcal{O}(\epsilon^{2})$ as a function of $N_{0},\dots,N_{L}$ . This is exactly considered in [8] for $\gamma=1$ and later in [3] for $\gamma\neq 1$ , and yields that

[TABLE]

where $K_{L}=\sum_{l=1}^{L}h_{l}^{(\beta-\gamma)/2}$ (see also [14, 6]). This gives a cost of $\mathcal{O}(\varepsilon^{-2}K_{L}^{2})$ per time step. Hence the following corollary is immediate.

Corollary 3.1 (ML Cost).

Given (A(A1)-(A3)) and Assumption 3.1, for any $n>0$ and any $\varphi\in\mathcal{B}_{b}(\Theta\times\mathsf{X}^{n+1})\cap\textrm{Lip}(\Theta\times\mathsf{X}^{n+1})$ , $(L,\{N_{l}\}_{l=1}^{L})$ can be chosen such that the estimator $\sum_{l=1}^{L}E_{l}^{N_{l}}(\varphi),$ with $E_{l}^{N_{l}}$ given in (12), satisfies

[TABLE]

for some $C>0$ , for a total cost controlled by

[TABLE]

In contrast, for the same scenario, the computational cost of PMCMC is $\mathcal{O}(\epsilon^{-2-\gamma/\alpha})$ per time step, which is asymptotically greater than the method developed here.

It is remarked that all of our constants depend upon the time parameter (number of data points) and this element has been ignored. This is due to the technical complexity of the approach. We expect that the constants can be made time-uniform, and hence we conjecture that the results hold true uniformly in time. Then $N_{l}$ can be chosen as above, and for Euler Maruyama ( $\beta=\gamma=1$ [10]) the cost for a given $n$ will be $\mathcal{O}(n^{2}|\log(\epsilon)|^{2}\epsilon^{-2})$ , with similar results for $\beta\neq 1$ , according to (15). This results because one needs to take $M=\mathcal{O}(n)$ for the particle filter in PMMH [1] and the cost to obtain a single sample particle filter trajectory is $\mathcal{O}(n)$ . A verification of this is left for future work.

4 Numerical Simulations

4.1 Ornstein-Uhlenbeck process

First, we consider the following Ornstein-Uhlenbeck process,

[TABLE]

where $\mathcal{N}(m,\tau^{2})$ denotes the Normal distribution with mean $m$ and variance $\tau^{2}$ . Further, the parameters $(\theta,\sigma)$ are unknown and are given the following priors,

[TABLE]

where $\mathcal{G}(a,b)$ denotes the Gamma distribution with shape $a$ and scale $b$ . The remaining parameters are defined as constants, $x_{0}=0$ , $\mu=0$ , $\delta=0.5$ , and $\tau^{2}=0.2$ . A data set with 100 observations is simulated with $\theta=1$ and $\sigma=0.5$ .

4.2 Langevin SDE

Consider the following Langevin SDE,

[TABLE]

where $\pi(x)$ denote the probability density function of a Student’s $t$ -distribution with $\theta$ degrees of freedom. The parameters of interest are $(\theta,\sigma)$ , and these are given prior,

[TABLE]

The constants are $x_{0}=0$ and $\delta=1$ . A data set with 1,000 observations is simulated with $\theta=10$ , $\sigma=1$ , and $\tau^{2}=1$ .

4.3 Simulation settings

The simulations proceed as the following. Let $h=2^{-l}$ be the accuracy parameter. At each level $l$ , we set the number of particles in the PMCMC kernel be $M=\mathcal{O}(n)$ fixed, and set the number of PMCMC samples for estimation according to the multilevel analysis. Let $N_{l}^{L}$ denote the number of samples at level $l$ within a simulation that targets $L$ -level error, $L=1,\dots$ . The value of $N_{0}^{1}$ is determined empirically with variance estimated from 100 samplers. For comparison, a single-level PMCMC sampler is also considered for each $L$ . Its number of samples $N^{L}$ is determined empirically by running 100 simulations simultaneously. And these chains are run until the estimated error of the 100 estimates matches that of the multilevel sampler. In all situations, a fixed burn-in period of 10,000 iterations is used. This is reasonable given the fast decorrelation of the chains, as illustrated by the estimated autocorrelation of the single level PMCMC sampler for $L=8$ in Figure 1. The autocorrelation functions look similar for all $l$ for the multilevel sampler.

4.4 Results

We consider the choice of $M=\mathcal{O}(n)$ . The main results of the cost vs. error are shown in Figure 2. The estimated cost rates are listed in table 1. It is shown in the appendix that for Euler discretization the method satisfies the assumptions (A(A1)-(A3)) with $\beta=2$ in (A(A2)), since the diffusion term $b_{\theta}$ is constant in $x$ [10]. Furthermore, Assumption 3.1 holds with $\gamma=\alpha=1$ . Therefore, the theoretical results of Theorem 3.1 and Corollary 3.1 predict the rate $\mathcal{O}(\epsilon^{-2})$ . Standard PMMH will incur a cost of $\mathcal{O}(\epsilon^{-3})$ . The numerical results confirm this.

Acknowledgements

AJ & YZ were supported by an AcRF tier 2 grant: R-155-000-161-112. AJ is affiliated with the Risk Management Institute, the Center for Quantitative Finance and the OR & Analytics cluster at NUS. KK & AJ acknowledge CREST, JST for additionally supporting the research. KJHL was supported by ORNL LDRD Strategic Hire grant 32112580.

Appendix A Technical Results

A Markov kernel $K$ can be viewed as a linear operator $(Kf)(w)=\int K(w,dw^{*})f(w^{*})$ for $f:\mathsf{W}\rightarrow\mathbb{R}$ on a Hilbert space

[TABLE]

with an inner product $\langle f,g\rangle=\int f(w)g(w)\eta(dw)$ and norm $\|f\|_{2}=\sqrt{\langle f,f\rangle}$ . Let $\|K\|_{2}=\sup_{f\in L^{2}_{0}(\eta),f\neq 0}\|Kf\|_{2}/\|f\|_{2}$ be the operator norm.

By Döblin condition (A(A3)), we have the total variation distance bound $\|K(w,\cdot)-\eta\|_{\mathrm{TV}}=\sup_{A\in\mathcal{W}}|K(w,A)-\eta(A)|\leq 1-\xi\ (\forall w\in\mathsf{W})$ for some $\xi\in(0,1)$ . Since $K$ is an Metropolis-Hastings kernel, it has $\eta$ -reversibility. Therefore, the total variation bound implies $L^{2}$ -spectral gap

[TABLE]

by Theorem 2.1 of [18].

For $\mu$ a finite measure on a measurable space $(\mathsf{E},\mathcal{E})$ and $\varphi\in\mathcal{B}_{b}(\mathsf{E})$

[TABLE]

Defining $v^{i}=(\theta^{i},z_{0:n}^{i})$ as the relevant variables of $w^{i}$ from the MCMC kernel, and defining

[TABLE]

we are interested in estimates of the form:

[TABLE]

Proposition A.1.

Assume (A(A1)-(A3)). Suppose that $\{W^{i}\}_{i}$ is a Markov chain with the Markov kernel $K$ , and $W^{1}\sim\eta$ . Then for any $n>0$ , there exists a $\beta>0$ such that for any $\varphi\in\mathcal{B}_{b}(\Theta\times\mathsf{X}^{n+1})\cap\textrm{Lip}(\Theta\times\mathsf{X}^{n+1})$ there exists a $C<+\infty$ such that

[TABLE]

where $V^{i}=(\theta,Z_{0:n}^{i})$ is the relevant variables of $W^{i}$ .

Proof.

Denote the map $w^{i}\mapsto v^{i}$ by $\psi$ . Then

[TABLE]

for $f(w)=\tilde{\varphi}_{h}\circ\psi(w)-\eta(\tilde{\varphi}_{h}\circ\psi)=\tilde{\varphi}_{h}(v)-\pi_{h,h^{\prime}}(\tilde{\varphi}_{h})$ . By simple algebra,

[TABLE]

On the other hand, by Lemma A.1,

[TABLE]

Thus, the claim follows. ∎

Lemma A.1.

Assume (A(A1)-(A2)). Then for any $n>0$ , $q\in\{1,2\}$ there exists a $\beta>0$ such that for any $\varphi\in\mathcal{B}_{b}(\Theta\times\mathsf{X}^{n+1})\cap\textrm{Lip}(\Theta\times\mathsf{X}^{n+1})$ there exists a $C<+\infty$

[TABLE]

Proof.

We prove the result for $q=1$ , the case $q=2$ being almost the same. The result is proved by induction on $n$ . Set $n=1$ , then

[TABLE]

As $G_{1,\theta}(z)$ is uniformly (in $\theta,z$ ) bounded below, the denominator on the R.H.S. is uniformly lower bounded by a constant that is independent of $h,h^{\prime}$ . The numerator on the R.H.S. is

[TABLE]

Application of (A(A2)) hence yields

[TABLE]

Assuming the result for $k-1$ , $k>1$ , by the above argument we only have to consider

[TABLE]

The R.H.S. can be upper-bounded by

[TABLE]

The first term can be treated by the induction hypothesis and the second term via (A(A2)) which completes the proof. ∎

Appendix B Coupling Euler Approximations

Consider $(x,x^{\prime})\in\mathsf{X}^{2}$ , the current position of the discretized diffusions. Now we have $h,h^{\prime}$ the discretization levels, with $0<h<h^{\prime}$ and for simplicity set $h^{\prime}=2h$ . Associated to the discretization level $h$ (resp. $h^{\prime}$ ), one must sample $k=\delta/h$ (resp. $k^{\prime}=\delta/h^{\prime}$ ) points to obtain the sampled position of the diffusion at the next observation time. Set $X(0)=X^{\prime}(0)\sim f_{\theta}(x)dx$ then one can sample the fine discretization, for $m\in\{0,\dots,k-1\}$ as

[TABLE]

where $\xi(m)\stackrel{{\scriptstyle\textrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,I_{d})$ ( $I_{d}$ is the $d\times d$ identity matrix). For the course discretization, using the same simulated $\xi(0),\dots,\xi(k-1)$ we set for $m\in\{0,\dots,k^{\prime}-1\}$

[TABLE]

Now, we want to check conditions (A(A2)) and (A(A3)) under Assumption 2.1 (i,ii), (A(A1)) and the following assumption.

Assumption B.1.

$\Theta$ * is a compact set of $\mathbb{R}^{d_{\theta}}$ , and $\pi_{\theta}:\Theta\rightarrow\mathbb{R}_{+}$ and $q(\theta^{*}|\theta):\Theta^{2}\rightarrow\mathbb{R}_{+}$ are continous and strictly positive.*

By assumption, $Q_{\theta,h,h^{\prime}}(z,z^{*})$ is the density of $Z^{*}=(X(k),X^{\prime}(k^{\prime}))$ given $Z=(X(0),X^{\prime}(0))$ . Then, under Assumption 2.1 (i, ii), the condition (A(A2)) is satisfied with $\beta=1$ for any $q=1,2$ , since this is the $L^{q}$ bound of the Euler-Maruyama scheme (in fact for constant diffusion coefficient $b_{\theta}$ it coincides with the Milstein method and $\beta=2$ ) [10].

Next, we want to check the condition (A(A3)). The proposal density $\psi$ on $\mathsf{W}=\Theta\times\{1,\ldots,M\}^{n}\times Z^{M(n+1)}\times\{1,\ldots,M\}$ of PMMH is

[TABLE]

where $w=(\theta,a_{0:(n-1)}^{1:M},z_{1:n}^{1:M},k)$ , $w^{*}=(\theta^{*},a_{0:(n-1)}^{*,1:M},z_{1:n}^{*,1:M},l)$ , and $p(a_{0:n-1}^{*,1:M},z_{0:n}^{*,1:M}|\theta^{*})$ is defined in (9). The transition kernel $K$ is

[TABLE]

where the acceptance probability $\alpha(w,w^{*})$ is

[TABLE]

and the rejection probability $R(w)$ is

[TABLE]

By (A(A1)) together with Assumption B.1, $C_{1}=\inf_{w\in W}\alpha(w,w^{*})>0$ , and

[TABLE]

for a constant $C_{2}=\min_{\theta,\theta^{*}}q(\theta^{*}|\theta)>0$ with a probability density $\psi(w^{*})$ . Thus, we have

[TABLE]

In particular, the condition (A3) is satisfied.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Andrieu , C., Doucet , A. & Holenstein , R. (2010). Particle Markov chain Monte Carlo methods (with discussion). J. R. Statist. Soc. Ser. B , 72 , 269–342.
2[2] Cappé , O., Ryden , T, & Moulines , É. (2005). Inference in Hidden Markov Models . Springer: New York.
3[3] Cliffe, K.A., Giles, M.B., Scheichl, R. and Teckentrup, A.L. (2011). Multilevel Monte Carlo methods and applications to elliptic PD Es with random coefficients. Comp. Visual. Sci. , 14 , 3–15.
4[4] Deligiannidis , G., Doucet , A. & Pitt , M. K. (2015). The correlated pseudo-marginal method. ar Xiv preprint ar Xiv:1511.0492 .
5[5] Del Moral , P. (2004). Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications . Springer: New York.
6[6] Del Moral, P., Jasra, A., Law, K. and Zhou, Y. (2016). Multilevel Sequential Monte Carlo Samplers for Normalizing Constants. ar Xiv preprint ar Xiv:1603.01136 .
7[7] Doucet , A. & Johansen , A. (2011). A tutorial on particle filtering and smoothing: Fifteen years later. In Handbook of Nonlinear Filtering (eds. D. Crisan & B. Rozovsky), Oxford University Press: Oxford.
8[8] Giles , M. B. (2008). Multilevel Monte Carlo path simulation. Op. Res. , 56 , 607-617.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Bayesian Static Parameter Estimation for Partially

Abstract

1 Introduction

2 Model

Assumption 2.1**.**

3 Method and Analysis

3.1 Method

3.1.1 Particle MCMC

3.2 Multilevel Considerations

3.3 Analysis

Theorem 3.1**.**

Proof.

3.4 A Return to Multilevel Considerations

Assumption 3.1**.**

Corollary 3.1** (ML Cost).**

4 Numerical Simulations

4.1 Ornstein-Uhlenbeck process

4.2 Langevin SDE

4.3 Simulation settings

4.4 Results

Acknowledgements

Appendix A Technical Results

Proposition A.1**.**

Proof.

Lemma A.1**.**

Proof.

Appendix B Coupling Euler Approximations

Assumption B.1**.**

Assumption 2.1.

Theorem 3.1.

Assumption 3.1.

Corollary 3.1 (ML Cost).

Proposition A.1.

Lemma A.1.

Assumption B.1.