Markov chain Monte Carlo importance samplers for Bayesian models with   intractable likelihoods

Jordan Franks

arXiv:1904.05886·stat.CO·April 15, 2019

Markov chain Monte Carlo importance samplers for Bayesian models with intractable likelihoods

Jordan Franks

PDF

Open Access

TL;DR

This paper develops and analyzes Markov chain Monte Carlo importance sampling methods for Bayesian models with intractable likelihoods, providing convergence theorems, efficiency comparisons, and applications to complex stochastic models.

Contribution

It introduces a novel MCMC-IS framework with convergence results, handles pseudo-marginal and reweighted estimators, and applies these methods to diffusion processes and ABC, enhancing inference accuracy.

Findings

01

Convergence and CLT established for MCMC-IS estimators.

02

Efficiency comparisons with pseudo-marginal and ABC methods.

03

Successful application to parameter inference in stochastic differential equations.

Abstract

We consider the efficient use of an approximation within Markov chain Monte Carlo (MCMC), with subsequent importance sampling (IS) correction of the Markov chain inexact output, leading to asymptotically exact inference. We detail convergence and central limit theorems for the resulting MCMC-IS estimators. We also consider the case where the approximate Markov chain is pseudo-marginal, requiring unbiased estimators for its approximate marginal target. Convergence results with asymptotic variance formulae are shown for this case, and for the case where the IS weights based on unbiased estimators are only calculated for distinct output samples of the so-called `jump' chain, which, with a suitable reweighting, allows for improved efficiency. As the IS type weights may assume negative values, extended classes of unbiased estimators may be used for the IS type correction, such as those…

Tables1

Symbol	Page	Meaning
$(𝖸, 𝒴, 𝒫)$	1.1	statistical model
$𝒫$	1.1	family of probability distributions
$p^{(θ)}$	1.1	data (probability) distribution
$L (θ)$	1.1	likelihood of parameter $θ$
$𝒫_{ℓ}$	1.4	$ℓ$ -approximate family of data (probability) distributions
$𝒫_{\infty}$	1.4	ideal family of data (probability) distributions
$(M_{p}, G_{p})$	2.1	Feynman-Kac model with transition $M_{p}$ and potential $G_{p}$
$p^{(θ, ℓ)}$	1.4	data (probability) distribution in $𝒫_{ℓ}$
	3	$ℓ$ -smoother for Feynman-Kac model $(M_{p}^{(θ, ℓ)}, G_{p}^{(θ)})$
$p_{u}^{(θ, ℓ)}$	4	unnormalised $ℓ$ -smoother on latent states for model $(M_{p}^{(θ, ℓ)}, G_{p}^{(θ)})$
$π^{(ℓ)}$	5	joint $ℓ$ -posterior probability on parameters and latent states
$π_{m}^{(ℓ)}$	2.2	marginal $ℓ$ -posterior probability on parameters
$p_{u}$	6	unnormalised smoother for model $(M_{p}, G_{p})$
${\hat{p}}_{u}$	9	unbiased estimator for $p_{u}$ for model $(M_{p}, G_{p})$
$\hat{L} (θ)$	3	unbiased estimator for $L (θ)$
$σ_{K}^{2}$	3.3	asymptotic variance for Markov kernel $K$
$ℰ_{K}$	3.3	Dirichlet form for Markov kernel $K$
$Π^{(0)}$	3.4	stationary probability of $0$ -model Markov chain
$Π^{(\infty)}$	3.4	stationary probability of $\infty$ -model Markov chain
$w$	3.4	importance sampling weight from $Π^{(0)}$ to $Π^{(\infty)}$
$σ_{I S}^{2}$	26	asymptotic variance of MCMC-IS estimator
$Δ^{(θ, ℓ)}$	ii	delta particle filter unbiased estimator
${(p_{ℓ})}_{ℓ \in ℕ}$	4.4	probability mass function on $ℓ = 1, 2, 3, \dots$
${\tilde{Δ}}^{(θ, ℓ)} (ϕ)$	33	unbiased estimator for $p_{u}^{(θ, ℓ)} (ϕ)$
$𝒞 (m)$	4.6	total computational cost for $m$ iterations
$ℳ (κ)$	4.6	realised length of the chain with budget $κ$
$p^{(θ, 1 / ϵ)}$	5	data (probability) distribution, with ABC tolerance $ϵ$
		(or precision $1 / ϵ$ )
$L^{(1 / ϵ)}$	5	approximate Bayesian computation (ABC) likelihood
		with ABC tolerance $ϵ$ (or precision $1 / ϵ$ )

Equations216

P = {p^{(θ)}}_{θ \in T},

P = {p^{(θ)}}_{θ \in T},

L (θ) : = p^{(θ)} (y),

L (θ) : = p^{(θ)} (y),

π (θ) : = p (θ ∣ y) = \frac{L ( θ ) pr ( θ )}{p ( y )},

π (θ) : = p (θ ∣ y) = \frac{L ( θ ) pr ( θ )}{p ( y )},

p (y) : = \int L (θ) pr (θ) d θ .

p (y) : = \int L (θ) pr (θ) d θ .

P_{ℓ} = {p^{(θ, ℓ)}}_{θ \in T},

P_{ℓ} = {p^{(θ, ℓ)}}_{θ \in T},

\dots

\dots

\bar{M}_{p}^{(\theta,\infty)}(x_{0:p-1},\mathrm{d}x_{0:p}^{\prime})\mathrel{\mathop{:}}=\mathbf{1}\big{\{}x_{0:p-1}^{\prime}=x_{0:p-1}\big{\}}M_{p}^{(\theta,\infty)}(x_{0:p-1},\mathrm{d}x_{p}^{\prime}),

\bar{M}_{p}^{(\theta,\infty)}(x_{0:p-1},\mathrm{d}x_{0:p}^{\prime})\mathrel{\mathop{:}}=\mathbf{1}\big{\{}x_{0:p-1}^{\prime}=x_{0:p-1}\big{\}}M_{p}^{(\theta,\infty)}(x_{0:p-1},\mathrm{d}x_{p}^{\prime}),

G_{p}^{(θ)} (x_{0 : p}) : = g_{p}^{(θ)} (y_{p} ∣ x_{0 : p}) .

G_{p}^{(θ)} (x_{0 : p}) : = g_{p}^{(θ)} (y_{p} ∣ x_{0 : p}) .

μ (∣ ϕ ∣^{s})^{\frac{1}{s}} < \infty, where μ (ϕ) : = \int ϕ (x) μ (d x) .

μ (∣ ϕ ∣^{s})^{\frac{1}{s}} < \infty, where μ (ϕ) : = \int ϕ (x) μ (d x) .

p^{(θ, ℓ)} (d x_{0 : n}) = \frac{p _{u}^{(θ, ℓ)} ( d x _{0 : n} )}{p _{u}^{(θ, ℓ)} ( 1 )},

p^{(θ, ℓ)} (d x_{0 : n}) = \frac{p _{u}^{(θ, ℓ)} ( d x _{0 : n} )}{p _{u}^{(θ, ℓ)} ( 1 )},

p^{(\theta,\ell)}_{u}(\mathrm{d}x_{0:n})=\bigg{(}\prod_{p=0}^{n}G^{(\theta)}_{p}(x_{0:p})\bigg{)}\eta_{0}^{(\theta)}(\mathrm{d}x_{0})\prod_{p=1}^{n}M_{p}^{(\theta,\ell)}(x_{0:p-1},\mathrm{d}x_{p}).

p^{(\theta,\ell)}_{u}(\mathrm{d}x_{0:n})=\bigg{(}\prod_{p=0}^{n}G^{(\theta)}_{p}(x_{0:p})\bigg{)}\eta_{0}^{(\theta)}(\mathrm{d}x_{0})\prod_{p=1}^{n}M_{p}^{(\theta,\ell)}(x_{0:p-1},\mathrm{d}x_{p}).

π^{(ℓ)} (d θ, d x_{0 : n}) \propto pr (d θ) p_{u}^{(θ, ℓ)} (d x_{0 : n}) .

π^{(ℓ)} (d θ, d x_{0 : n}) \propto pr (d θ) p_{u}^{(θ, ℓ)} (d x_{0 : n}) .

π_{m}^{(ℓ)} (d θ) = \int_{X^{n + 1}} π^{(ℓ)} (d θ, d x_{0 : n}) = \frac{pr ( d θ ) L ^{(ℓ)} ( θ )}{\int pr ( d θ ) L ^{(ℓ)} ( θ )} .

π_{m}^{(ℓ)} (d θ) = \int_{X^{n + 1}} π^{(ℓ)} (d θ, d x_{0 : n}) = \frac{pr ( d θ ) L ^{(ℓ)} ( θ )}{\int pr ( d θ ) L ^{(ℓ)} ( θ )} .

p_{u}(\mathrm{d}x_{0:n})=\bigg{(}\prod_{p=0}^{n}G_{p}(x_{0:p})\bigg{)}\eta_{0}(\mathrm{d}x_{0})\prod_{p=1}^{n}M_{p}(x_{0:p-1},\mathrm{d}x_{p}),

p_{u}(\mathrm{d}x_{0:n})=\bigg{(}\prod_{p=0}^{n}G_{p}(x_{0:p})\bigg{)}\eta_{0}(\mathrm{d}x_{0})\prod_{p=1}^{n}M_{p}(x_{0:p-1},\mathrm{d}x_{p}),

M_{p} (X_{0 : p - 1}, d x_{p}^{'})

M_{p} (X_{0 : p - 1}, d x_{p}^{'})

G_{p} (X_{0 : p})

\mathbb{E}\Big{[}\sum_{k=1}^{N}\mathbf{1}\big{\{}A_{p}^{(k)}=i\big{\}}\Big{]}=N\frac{V_{p}^{(i)}}{V_{p}^{*}},

\mathbb{E}\Big{[}\sum_{k=1}^{N}\mathbf{1}\big{\{}A_{p}^{(k)}=i\big{\}}\Big{]}=N\frac{V_{p}^{(i)}}{V_{p}^{*}},

\overset{p}{^}_{u} (ϕ) : = i = 1 \sum N V_{n}^{(i)} ϕ (X^{(i)}),

\overset{p}{^}_{u} (ϕ) : = i = 1 \sum N V_{n}^{(i)} ϕ (X^{(i)}),

\mathbb{E}\big{[}\hat{p}_{u}(\phi)\big{]}=p_{u}(\phi).

\mathbb{E}\big{[}\hat{p}_{u}(\phi)\big{]}=p_{u}(\phi).

\bigg{(}\prod_{p=0}^{n}\tilde{G}_{p}(x_{0:p})\bigg{)}\tilde{\eta}_{0}(\mathrm{d}x_{0})\prod_{p=1}^{n}\tilde{M}_{p}(x_{0:p-1},\mathrm{d}x_{p})=\bigg{(}\prod_{p=0}^{n}G_{p}(x_{0:p})\bigg{)}\eta_{0}(\mathrm{d}x_{0})\prod_{p=1}^{n}M_{p}(x_{0:p-1},\mathrm{d}x_{p}).

\bigg{(}\prod_{p=0}^{n}\tilde{G}_{p}(x_{0:p})\bigg{)}\tilde{\eta}_{0}(\mathrm{d}x_{0})\prod_{p=1}^{n}\tilde{M}_{p}(x_{0:p-1},\mathrm{d}x_{p})=\bigg{(}\prod_{p=0}^{n}G_{p}(x_{0:p})\bigg{)}\eta_{0}(\mathrm{d}x_{0})\prod_{p=1}^{n}M_{p}(x_{0:p-1},\mathrm{d}x_{p}).

\tilde{G}_{p} (x_{0 : p}) : = \frac{G _{p} ( x _{0 : p} ) M _{p} ( x _{0 : p - 1} , d x _{p} )}{M ~ _{p} ( x _{0 : p - 1} , d x _{p} )} (x_{0 : p})

\tilde{G}_{p} (x_{0 : p}) : = \frac{G _{p} ( x _{0 : p} ) M _{p} ( x _{0 : p - 1} , d x _{p} )}{M ~ _{p} ( x _{0 : p - 1} , d x _{p} )} (x_{0 : p})

\frac{\sum _{k = 1}^{m} p ^ _{u, k} ( ϕ )}{\sum _{k = 1}^{m} p ^ _{u, k} ( 1 )} m \to \infty p^{(θ)} (ϕ), almost surely .

\frac{\sum _{k = 1}^{m} p ^ _{u, k} ( ϕ )}{\sum _{k = 1}^{m} p ^ _{u, k} ( 1 )} m \to \infty p^{(θ)} (ϕ), almost surely .

\min\bigg{\{}1,\frac{\mathrm{pr}(\Theta^{\prime})\big{(}\sum_{i=1}^{N}V^{\prime(i)}\big{)}q(\Theta_{k-1}|\Theta^{\prime})}{\mathrm{pr}(\Theta_{k-1})\big{(}\sum_{i=1}^{N}V_{k-1}^{(i)}\big{)}q(\Theta^{\prime}|\Theta_{k-1})}\bigg{\}}.

\min\bigg{\{}1,\frac{\mathrm{pr}(\Theta^{\prime})\big{(}\sum_{i=1}^{N}V^{\prime(i)}\big{)}q(\Theta_{k-1}|\Theta^{\prime})}{\mathrm{pr}(\Theta_{k-1})\big{(}\sum_{i=1}^{N}V_{k-1}^{(i)}\big{)}q(\Theta^{\prime}|\Theta_{k-1})}\bigg{\}}.

E_{m}^{P M} (f) : = \frac{1}{m} k = 1 \sum m i = 1 \sum N \frac{V _{k}^{(i)} f ( Θ _{k} , X _{k}^{(i)} )}{\sum _{j = 1}^{N} V _{k}^{(j)}} m \to \infty π^{(\infty)} (f), a.s.

E_{m}^{P M} (f) : = \frac{1}{m} k = 1 \sum m i = 1 \sum N \frac{V _{k}^{(i)} f ( Θ _{k} , X _{k}^{(i)} )}{\sum _{j = 1}^{N} V _{k}^{(j)}} m \to \infty π^{(\infty)} (f), a.s.

\hat{L}^{(\infty)} (θ) : = \overset{p}{^}_{u}^{(θ, \infty)} (1) = i = 1 \sum N V^{(i)}

\hat{L}^{(\infty)} (θ) : = \overset{p}{^}_{u}^{(θ, \infty)} (1) = i = 1 \sum N V^{(i)}

\min\bigg{\{}1,\frac{\mathrm{pr}(\Theta^{\prime})\big{(}\hat{L}^{(0)}(\Theta^{\prime})+\epsilon\big{)}q(\Theta_{k-1}|\Theta^{\prime})}{\mathrm{pr}(\Theta_{k-1})\big{(}\hat{L}^{(0)}(\Theta_{k-1})+\epsilon\big{)}q(\Theta^{\prime}|\Theta_{k-1})}\bigg{\}}.

\min\bigg{\{}1,\frac{\mathrm{pr}(\Theta^{\prime})\big{(}\hat{L}^{(0)}(\Theta^{\prime})+\epsilon\big{)}q(\Theta_{k-1}|\Theta^{\prime})}{\mathrm{pr}(\Theta_{k-1})\big{(}\hat{L}^{(0)}(\Theta_{k-1})+\epsilon\big{)}q(\Theta^{\prime}|\Theta_{k-1})}\bigg{\}}.

\min\bigg{\{}1,\frac{\big{(}\sum_{i=1}^{N}V^{\prime(i)}\big{)}/\big{(}\hat{L}^{(0)}(\Theta^{\prime})+\epsilon\big{)}}{\big{(}\sum_{i=1}^{N}V_{k-1}^{(i)}\big{)}/\big{(}\hat{L}^{(0)}(\Theta_{k-1})+\epsilon\big{)}}\bigg{\}}.

\min\bigg{\{}1,\frac{\big{(}\sum_{i=1}^{N}V^{\prime(i)}\big{)}/\big{(}\hat{L}^{(0)}(\Theta^{\prime})+\epsilon\big{)}}{\big{(}\sum_{i=1}^{N}V_{k-1}^{(i)}\big{)}/\big{(}\hat{L}^{(0)}(\Theta_{k-1})+\epsilon\big{)}}\bigg{\}}.

\hat{L}^{(\infty)}(\theta)>0\implies\big{(}\hat{L}^{(0)}(\theta)+\epsilon\big{)}>0

\hat{L}^{(\infty)}(\theta)>0\implies\big{(}\hat{L}^{(0)}(\theta)+\epsilon\big{)}>0

E_{m}^{I S} (f) : = \frac{\sum _{k = 1}^{m} ξ _{k} ( f ^{(Θ_{k})} )}{\sum _{k = 1}^{m} ξ _{k} ( 1 )} .

E_{m}^{I S} (f) : = \frac{\sum _{k = 1}^{m} ξ _{k} ( f ^{(Θ_{k})} )}{\sum _{k = 1}^{m} ξ _{k} ( 1 )} .

E_{m}^{I S} (f) m \to \infty π^{(\infty)} (f), almost surely .

E_{m}^{I S} (f) m \to \infty π^{(\infty)} (f), almost surely .

π_{m}^{(0)} (θ) \propto pr (θ) L^{(0)} (θ) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Bayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference

Full text

\nobibliography

\incgraph

[documentpaper, overlay=] [width=height=]jyu_cover_a4_1.pdf

\incgraph

[documentpaper, overlay=] [width=height=]jyu_cover_a4_2.pdf

Abstract

Markov chain Monte Carlo (MCMC) is an approach to parameter inference in Bayesian models that is based on computing ergodic averages formed from a Markov chain targeting the Bayesian posterior probability. We consider the efficient use of an approximation within the Markov chain, with subsequent importance sampling (IS) correction of the Markov chain inexact output, leading to asymptotically exact inference. We detail convergence and central limit theorems for the resulting MCMC-IS estimators. We also consider the case where the approximate Markov chain is pseudo-marginal, requiring unbiased estimators for its approximate marginal target. Convergence results with asymptotic variance formulae are shown for this case, and for the case where the IS weights based on unbiased estimators are only calculated for distinct output samples of the so-called ‘jump’ chain, which, with a suitable reweighting, allows for improved efficiency. As the IS type weights may assume negative values, extended classes of unbiased estimators may be used for the IS type correction, such as those obtained from randomised multilevel Monte Carlo. Using Euler approximations and coupling of particle filters, we apply the resulting estimator using randomised weights to the problem of parameter inference for partially observed Itô diffusions. Convergence of the estimator is verified to hold under regularity assumptions which do not require that the diffusion can be simulated exactly. In the context of approximate Bayesian computation (ABC), we suggest an adaptive MCMC approach to deal with the selection of a suitably large tolerance, with IS correction possible to finer tolerance, and with provided approximate confidence intervals. A prominent question is the efficiency of MCMC-IS compared to standard direct MCMC, such as pseudo-marginal, delayed acceptance, and ABC-MCMC. We provide a comparison criterion which generalises the covariance ordering to the IS setting. We give an asymptotic variance bound relating MCMC-IS with the latter chains, as long as the ratio of the true likelihood to the approximate likelihood is bounded. We also perform various experiments in the state space model and ABC context, which confirm the validity and competitiveness of the suggested MCMC-IS estimators in practice.

1 Introduction
1.1 Likelihoods
1.2 Bayesian inference
1.3 Challenges for inference
1.4 Approximate families
1.5 Overview
2 Bayesian inference for state space models on path space
2.1 Discretely-observed continuous-time path-dependent process
2.2 Model probabilities
2.3 Particle filter
2.4 Particle marginal Metropolis-Hastings
3 Accelerations based on an approximation
3.1 Delayed acceptance and importance sampling
3.2 The question of relative efficiency
3.3 Peskun and covariance orderings of asymptotic variances
3.4 Peskun type ordering for importance sampling correction
3.5 Comparison results
4 Bayesian inference for state space models with diffusion dynamics
4.1 Euler approximations
4.2 Multilevel Monte Carlo
4.3 Coupling of Feynman-Kac models
4.4 Debiasing techniques
4.5 Joint inference using importance sampling type correction
4.6 Computational efficiency and allocations
5 Inference via approximate Bayesian computation
5.1 Choosing the tolerance in ABC-MCMC
5.2 Approximate confidence intervals
5.3 Adaptive ABC-MCMC with post-correction
5.4 Convergence of the tolerance adaptive ABC-MCMC
6 Discussion and directions for future work
7 Summary of articles
7.1 sec
7.2 sec
7.3 sec
7.4 sec

List of symbols

Foreword

The modern reliance on probability theory to model the universe and various aspects of life reveals on the one hand our tremendous lack of knowledge and ability to understand and hence predict the workings of the universe with Newtonian precision. On the other hand, the success of probability theory reveals the hidden order of the universe, as well as the significant deductive reasoning capacities of humankind, where from the disorder of incomplete knowledge arises the order of probabilistic laws. Statistics allows us to ascertain to what extent our deductive reasoning is justified by real observation. Statistics acts as the intermediary, allowing dialogue to proceed between our perceived knowledge of the (mechanistic and probabilistic) laws of the universe and of the universe as she presents herself to us in actual fact.

Of utmost importance for the development of statistics has been the increasing computational ability in the computer age [cf. 30]. As the speed of computers increases, so does the potential complexity of problems increase which statistical methods can handle with precision. Considerable interest therefore lies in the development of computational methods which are efficient and able to perform the demanding computational tasks of modern statistics. It is the scientific and humanistic hope for this thesis, that the work will serve to the advancement of human knowledge, and that it will be solely useful to the commendable pursuits of humankind.

As the fields of probability and statistics are intellectually challenging, any small progress in this field is dependent upon a stable, friendly, and stimulating working and living environment. First of all, I would like to thank Dr. Matti Vihola, for being a wonderful adviser, scientist, and person. In the beginning, I knew very little about computational statistics and Monte Carlo methods, but due greatly to Matti’s tremendous help and patience, my knowledge and skills in mathematics and statistics has grown considerably. This thesis would not have been possible without his help. The enclosed introductory text has also benefited greatly from his insightful remarks. As his first sole doctoral student, I have one of the early claims to be able to call him mathematico-statistical father.

As for a stable, friendly, and stimulating working and living environment, I would like to thank the University of Jyväskylä and its employees, for being able to pursue my doctoral studies here. The last three years have been enjoyable as a place to work, study, and live. Financial support is gratefully acknowledged from the Academy of Finland (‘Exact approximate Monte Carlo methods for complex Bayesian inference,’ grants 284513 and 312605, PI: Dr. Matti Vihola).

Sincere thanks to the reviewers, Prof. Marko Laine (Finnish Meteorological Institute) and Prof. Krzysztof Łatuszyński (University of Warwick).

There are many other individual persons whom I should thank for being a help these last few years. As I drew up an account of all the people whom I would like to thank, it became ever-expanding, touching every aspect and time of my life. I simply could not do proper justice to those who have helped me, and I would run the danger of leaving somebody unintentionally out. I therefore would simply like to thank the many precious people who have been a positive impact on me, without going into all the details here. They and there deeds are simply too many to be entrusted to these few pages.

I think the saying is true, and hope it is true: when someone has stayed somewhere long enough, the place becomes forever a part of the person. I wish to thank the many people in Jyväskylä whom I have enjoyed getting to know during these last few years. Language has not always been an insurmountable barrier. I will miss you, and I will miss Jyväskylä—the snow, the sauna, the summer, the lakes, the festivals, the food, the people—all of which make Finland a special place to live.

I mention regards to the researchers everywhere with whom I have had the privilege to meet. Statistics, like other scientific disciplines such as pure mathematics, involves many devotees interested in a common subject with undesirable distractions kept to a minimum. When immersed in a scientific subject, where validity is judged by logic and observation rather than might or necessity, when one is able to escape the day-to-day absorption of the human condition, then one is able to view the world from a new perspective. One sees like the astronaut, for whom, after seeing the Earth as the single terrestrial mass, life will never be the same.

Jordan Franks

Jyväskylä, Finland

April 4, 2019

List of included articles

The thesis consists of an introductory text (Sections 1-7) and the following articles listed in chronological order of preprint appearance.

1. Introduction

Bayesian inference often requires the use of computational simulation methods known as Monte Carlo methods. Monte Carlo methods use probabilistic sampling mechanisms and ensemble averages to estimate expectations, such as those taken with respect to the posterior probability of a Bayesian model. Therefore, in practice on a computer, Monte Carlo methods can be computationally intensive.

A further inferential challenge arises when the likelihood function of the Bayesian model is intractable. In some important settings, it is possible to obtain an unbiased estimator for the likelihood. One such setting is the state space model, where sequential Monte Carlo supplies the unbiased estimator. In settings where unbiased estimators are not possible, approximate Bayesian computation (ABC) may be used, assuming forward generation of the model is possible and a choice of tolerance size has been made. Though only an approximation to the original Bayesian model, the ABC model comes equipped with a straightforward unbiased estimator for its ABC likelihood.

In these two example settings, a Markov chain can be run, allowing for Markov dependence in the samples, as well as the use of the unbiased estimators for the (ABC) likelihood as part of a pseudo-marginal algorithm. As a result, the samples of the Markov chain are drawn asymptotically from the (ABC) posterior, and inference is based on averaging the samples obtained. This computational approach to Bayesian inference is known as Markov chain Monte Carlo (MCMC).

This thesis is concerned with a slightly different approach, namely, where the Markov chain targets an approximate marginal of the (ABC) posterior. The subsequent importance sampling (IS) type correction is performed by a reweighting of the inexact sample output, using the unbiased estimators, which yields asymptotically exact inference for the (ABC) posterior. The use of an approximation for the Markov chain target can be computationally efficient, as can be the parallel calculation of the IS weights on a parallel computer. Some of the resulting MCMC-IS estimators are well-known, but in practice have been used only rarely, in comparison to direct MCMC. In addition, the MCMC-IS approach is shown to offer additional flexibility compared to direct MCMC.

The rest of this Section 1 is laid out as follows. We present some important notions, such as that of a statistical model, likelihood function, and Bayesian model. We briefly describe the general goal of (likelihood-based) parameter inference in statistics, as well as some of the challenges of computation which the thesis seeks to address, specifically inference aided by use of an approximation. Section 1.5 concludes with an outline of the remainder of the text.

1.1. Likelihoods

A statistical model $(\mathsf{Y},\mathcal{Y},\mathscr{P})$ is composed of an observational space $\mathsf{Y}$ , together with its $\sigma$ -algebra of subsets $\mathcal{Y}$ , and a set $\mathscr{P}$ of probability distributions on $\mathsf{Y}$ [cf. 34]. We assume the family $\mathscr{P}$ is parametrised by a vector of model parameters $\theta\in\mathsf{T}$ , with $\mathsf{T}\subset\mathbb{R}^{n_{\theta}}$ for some $n_{\theta}\geq 1$ . That is,

[TABLE]

where $p^{(\theta)}(\mathrm{d}y)$ is a probability on $\mathsf{Y}$ , sometimes called the data distribution. The probability $p^{(\theta)}$ corresponds to a modeling of the dependency relationship of the observation $y$ , considered as a random variable, on the model parameter $\theta$ .

We assume for simplicity in this introduction that $p^{(\theta)}(\mathrm{d}y)$ has a density, also denoted $p^{(\theta)}(y)$ , with respect to a $\sigma$ -finite reference measure on $\mathsf{Y}$ . Fixing the observation $y\in\mathsf{Y}$ , we define the function

[TABLE]

which is known as the likelihood. One type of likelihood-based inference for $\theta$ is to answer which values of $\theta$ maximise $L(\theta)$ . This method of inference is known as maximum likelihood estimation (MLE) in a statistical model with observation [cf. 14]. In other words, MLE seeks to answer, which probability distribution on $\mathsf{Y}$ in $\mathscr{P}$ would most readily give rise to the observation.

1.2. Bayesian inference

In practice, MLE is highly dependent on the candidate set $\mathscr{P}$ of probabilities to consider. The set $\mathscr{P}$ could be parametrised by arbitrarily high dimensions of parameters, and is the result of the statistician’s modeling of the dependence of the observation $y$ on the model parameter $\theta$ . Going further, in light of this arbitrary construction of the set $\mathscr{P}$ , the statistician is arguably111The frequentist approach differs from the Bayesian approach considered here [cf. 30]. not out of bounds to specify which $\theta$ values are to be considered more probable and with more weight, based on prior knowledge or hypothesis.

This specification, for a statistical model with known observation, leads to the Bayesian model [cf. 33]. The Bayesian model consists of an assignment of a prior probability $\mathrm{pr}(\mathrm{d}\theta)$ to the model parameters, with density also denoted $\mathrm{pr}(\theta)$ . Inference for the Bayesian model then consists of quantification of the posterior probability

[TABLE]

where the last equality, giving the posterior in terms of the likelihood and prior, is the practically useful formula of Bayes, and $p(y)$ is the model evidence, defined by

[TABLE]

1.3. Challenges for inference

In statistical models of practical interest, the likelihood $L(\theta)$ is often intractable, meaning that it can not be evaluated pointwise. However, in many settings which we consider, we will see that $L(\theta)$ can be estimated unbiasedly, meaning it is possible to generate a random variable $\hat{L}_{\theta}$ such that $\mathbb{E}[\hat{L}_{\theta}]=L(\theta)$ . However, construction of a reliable unbiased estimator may be neither directly available, nor efficient.

The posterior $\pi(\theta)$ of the Bayesian model is in general intractable, and can not even be estimated unbiasedly. This is often the case even if the likelihood is tractable, because of the normalisation by the model evidence in (1), which is usually computationally intensive to calculate. In case of intractable likelihood in the Bayesian model, posterior inference is even more of a challenge, and one must usually rely on ergodic averages from Markov chain Monte Carlo (MCMC). Such averages are generally asymptotically exact (i.e. consistent) as the number of iterations of the MCMC algorithm increases, in the sense of a law of large numbers. However, MCMC can be computationally expensive to run. It can take hours, days, weeks, or longer, in order to ensure a ‘reliable’ MCMC estimate, where the level of reliability can be theoretically difficult to justify.

1.4. Approximate families

We will see that the use of approximations can help facilitate tractable, efficient and user-friendly inference. Let $\mathscr{P}_{\infty}$ denote a set of (ideal) model probability distributions on $\mathsf{Y}$ . In many cases in practice, it may be desirable to work with a surrogate family of probability distributions $\mathscr{P}_{0}$ . Going further, it may be desirable to work with a family

[TABLE]

of data (probability) distributions, with $\ell\in[0,\infty]$ used to indicate families of increasingly ‘better’ approximations. For example, inference using $\mathscr{P}_{\infty}$ may be too difficult to achieve or too costly, in which case using an approximate family $\mathscr{P}_{\ell}$ may be possible instead.

It is conceivably possible to incorporate $\mathscr{P}_{\ell}$ in a Bayesian inference method, which may lessen the computational cost of the algorithm, while in the end performing inference for $\mathscr{P}_{\infty}$ . The aim of this thesis is to show general strategies in different settings where this is possible.

1.5. Overview

We now outline the remainder of this text222As for the intended audience, in order to keep the text of moderate size we must suppose some notions from analysis [cf. 80], probability [cf. 34], simulation [cf. 61], and statistics [cf. 33]. We try to strike a balance, to make the text of interest both to those knowledgeable and less knowledgeable in the subject matter of the thesis.. The text seeks to serve as an introduction and summary for the thesis papers listed on page List of included articles. Methodological aspects are stressed for this introduction to the articles, as are some of the supporting theoretical results. Most of the details are left to the articles. For this introductory text, we do not give algorithms and results in full generality and for all cases. Rather we focus on a few important cases. For example, we consider only a few specific Markov chains, rather than general Harris ergodic chains for the IS correction, and we focus on the use of unbiased estimators from particle filters333also known as sequential Monte Carlo in state space models, rather than from general importance sampling distributions in latent variable models. Some more generality is provided in the original articles listed on page List of included articles.

We begin in Section 2 with a specific problem of intractable likelihoods for statistical models, namely, that of the state space model, and review how interacting particle systems known as particle filters [44, 94] can lead to unbiased estimators of the $\infty$ -likelihood (the likelihood corresponding to the family $\mathscr{P}_{\infty}$ ), as long as the dynamics of the state space model can be simulated. We also detail an MCMC known as the particle marginal Metropolis-Hastings (PMMH) [1] (see also [5, 9, 59, 69]), which allows for $\infty$ -inference for the corresponding Bayesian model posterior, when unbiased $\infty$ -likelihood estimators are available.

In Section 3, as in [1] we consider two different MCMCs, which are intended for acceleration of PMMH, and which are based on use of an approximate family $\mathscr{P}_{0}$ and unbiased estimators for the $\infty$ -likelihood. These are the delayed acceptance (DA) MCMC [cf. 59, 8, 16, 17, 41, 61] and the MCMC-IS [cf. 24, 37, 38, 48, 73, 1], both of which allow for unbiased estimators of the [math]-likelihood and $\infty$ -likelihood, which can be useful when deterministic approximations are not available444The references [41] for DA and [1] for MCMC-IS are most relevant in the unbiased estimator context for intractable likelihoods.. Based on an extension of the covariance ordering [67] to the IS setting, with differing reversible stationary probabilities and with unbiased estimators, we seek to compare the algorithms in terms of statistical efficiency, as in [2].

Section 4 is concerned with a discretely and partially observed Itô diffusion, where unbiased $\infty$ -likelihood estimates can not be directly obtained by the particle filter, because the dynamics of the diffusion can not be simulated. Instead, approximate families $\mathscr{P}_{\ell}$ based on Euler approximation [cf. 56] are used, along with multilevel [49, 36], randomisation [64, 77] and particle filter coupling [53] techniques, leading to an unbiased estimator for the $\infty$ -likelihood and to the Bayesian $\infty$ -posterior by using an MCMC-IS with randomised weights, as in [3].

Section 5 is concerned with Bayesian models with intractable likelihoods, where an unbiased estimator of the likelihood is not readily available, but where it is at least possible to generate artificial observations from $p^{(\theta)}(\mathrm{d}y^{\prime})$ . The approach of approximate Bayesian computation [cf. 92] is to use families $\mathscr{P}_{1/\epsilon}$ of approximations to $\mathscr{P}_{\infty}$ , where $\mathscr{P}_{1/\epsilon}$ is indexed by $\epsilon>0$ , the ‘tolerance,’ which is difficult to choose. We detail an approach based on an adaptive MCMC, as well as MCMC-IS [98], with approximate confidence intervals for post-correction to finer tolerance, as in [4].

We close with a brief discussion of ideas for future work in Section 6 and provide expanded individual summaries for the thesis papers in Section 7.

2. Bayesian inference for state space models on path space

We introduce a well-known class of models based on latent variables on a state space $(\mathsf{X},\mathcal{X})$ and conditionally independent observations on $(\mathsf{Y},\mathcal{Y})$ which is sufficiently general and motivates a main application area of Articles [1], [2] and [3] based on unbiased estimators and approximate families $\mathscr{P}_{\ell}$ .

2.1. Discretely-observed continuous-time path-dependent process

To motivate this general class of models, we consider an example of continuous-time latent process. Suppose there is a process $(X_{t}^{\prime})_{t\geq 0}$ of latent or hidden states $X_{t}^{\prime}\in\mathsf{X}$ , where $X_{t}^{\prime}$ depends on $(X_{s}^{\prime})_{0\leq s<t}$ and the model parameter $\theta$ . Also, suppose $(Y_{t}^{\prime})_{t\geq 0}$ is another process (of observations), where $Y_{t}^{\prime}$ depends on $(X_{s}^{\prime})_{0\leq s\leq t}$ and $\theta$ . We make the realistic assumption555since the continuum can not easily be recorded by electronic or other physical means that only finitely many observations $\{Y_{t_{p}}^{\prime}\}_{p=0}^{n}$ are gathered at observation times $\{t_{p}\}_{p=0}^{n}$ .

Let us set $Y_{p}\mathrel{\mathop{:}}=Y_{t_{p}}^{\prime}$ and $X_{p}\mathrel{\mathop{:}}=X_{t_{p}}^{\prime}$ . Let us define $X_{0:p}\mathrel{\mathop{:}}=(X_{0},\ldots,X_{p})$ , and for fixed parameter value $\theta$ , consider the following dependency structure involving conditionally independent observations:

[TABLE]

Here, the arrows denote a dependency relationship, described in the following, where the initial state $X_{0}\sim\eta_{0}^{(\theta)}$ is drawn from an initial distribution $\eta_{0}^{(\theta)}$ . The dynamics between states (on path space) $X_{0:p-1}$ and $X_{0:p}$ is defined by a Markov probability kernel $\bar{M}_{p}^{(\theta,\infty)}$ (on path space), where

[TABLE]

where $M_{p}^{(\theta,\infty)}$ is a Markov probability kernel from $\mathsf{X}^{p}$ to $\mathsf{X}$ induced by the dynamics of the path-dependent continuous-time process. The observations $Y_{p}$ are obtained via $Y_{p}\sim g_{p}^{(\theta)}(\,\cdot\,|X_{0:p})$ , where $g_{p}^{(\theta)}$ is the observational density.

Let us set as shorthand $M_{0}^{(\theta,\infty)}(x_{0:-1},\mathrm{d}x_{0})\mathrel{\mathop{:}}=\eta_{0}(\mathrm{d}x_{0})$ and

[TABLE]

The model described above in terms of the pair $(M_{p}^{(\theta,\infty)},G_{p}^{(\theta)})_{p=0}^{n}$ is known as a path-dependent state space model (SSM)666As SSM is also known as a hidden Markov model [cf. 14], especially in the engineering disciplines., or, more succinctly, as a Feynman-Kac model [cf. 18].

Simulation methods for the $\infty$ -Feynman-Kac model are impossible if the dynamics $M_{p}^{(\theta,\ell)}$ can not be simulated exactly. Besides some (important) exceptions, this is in general the case for continuous-time latent processes [cf. 19, Sect. 1.3]. However, we will see when we consider Itô diffusions in Section 4, that often one can obtain Euler type approximations of the original process, with precision denoted ‘ $\ell$ ’, leading to approximate dynamics $M_{p}^{(\theta,\ell)}$ between observation times [cf. 19, 56]. Using the same observational densities as for the exact model, we obtain a Feynman-Kac model $(M_{p}^{(\theta,\ell)},G_{p}^{(\theta)})_{p=0}^{n}$ derived from the Euler type approximation of the dynamics.

2.2. Model probabilities

We now describe some of the probabilities associated to a Feynman-Kac model $(M_{p}^{(\theta,\ell)},G_{p}^{(\theta)})_{p=0}^{n}$ .

First, we define a bit of standard notation from analysis. If $\mu$ is a probability measure and $s\geq 1$ , we denote by $L^{s}(\mu)$ the Banach space of real-valued functions $\phi$ , modulo equivalence in norm, with finite norm

[TABLE]

Consider now the conditional $\ell$ -model probability on the latents, or $\ell$ -smoother,

[TABLE]

where777In the notation $p_{u}^{(\theta,\ell)}(1)$ , we view $1$ as the function $x_{0:n}\mapsto 1$ , and the integral $p_{u}^{(\theta,\ell)}(\phi)=\int\phi(x_{0:n})p_{u}^{(\theta,\ell)}(\mathrm{d}x_{0:n})$ as in (2) for $\phi:\mathsf{X}^{n+1}\to\mathbb{R}$ .

[TABLE]

Then $p^{(\theta,\ell)}$ represents the probability to observe the latent states given the observations according to the Feynman-Kac model $(M_{p}^{(\theta,\ell)},G_{p}^{(\theta)})_{p=0}^{n}$ . In terms of a statistical model888really on $(\mathsf{X}\times\mathsf{Y},\mathcal{X}\otimes\mathcal{Y})$ , but we view $y_{0:n}\in\mathsf{Y}^{n+1}$ as fixed and therefore disregarded in the notation on $\mathsf{X}^{n+1}$ , we have $\mathscr{P}_{\ell}=\{p^{(\theta,\ell)}\}_{\theta\in\mathsf{T}}$ with $p^{(\theta,\ell)}(\mathrm{d}x_{0:n})$ defined in (3), for the Feynman-Kac model.

The joint $\ell$ -posterior probability for the Bayesian model over model parameters and latent states is then

[TABLE]

Writing the marginal $\ell$ -likelihood as $L^{(\ell)}(\theta)\mathrel{\mathop{:}}=p_{u}^{(\theta,\ell)}(1)$ and considering the marginal $\ell$ -posterior on $\theta$ , we obtain a more familiar formula to (1) given in the beginning in Section 1.2, namely,

[TABLE]

The main topic of this thesis is incorporation of $\ell$ -approximation within a $\infty$ -inference method, to obtain efficient and user-friendly inference with respect to $\pi^{(\infty)}$ and $\pi_{m}^{(\infty)}$ .

2.3. Particle filter

Ignoring the $\theta$ and $\ell$ labels, we have seen that a Feynman-Kac model (with time horizon $n$ ) is defined through a pair $(M_{p},G_{p})_{p=0}^{n}$ , where,

(i)

$M_{p}(x_{0:p-1},\mathrm{d}x_{p})$ is a Markov ‘transition’ kernel for $p=1,\ldots,n$ , and $M_{0}(x_{0:-1},\mathrm{d}x_{0})\mathrel{\mathop{:}}=\eta_{0}(\mathrm{d}x_{0})$ is a probability measure, and 2. (ii)

$G_{p}(x_{0:p})$ is a nonnegative ‘potential’ function for $0\leq p\leq n$ .

The particle filter (PF) (Algorithm 1) was popularised in [e.g. 94, 44], and allows for unbiased estimation [cf. 18, 28] of

[TABLE]

for (traditional) SSMs that are not path-dependent, that is,

[TABLE]

However, straightforward generalisation also allows for unbiased estimators in the path-dependent setting of Feynman-Kac models, at least when the dynamics can be simulated [cf. 18]. In addition, as is well-known, the general resampling scheme in PF (Algorithm 1) for ancestor random variables $\{A_{p}^{(i)}\}_{i=1}^{N}$ do lead to unbiased estimators, since the equality

[TABLE]

is assumed satisfied for all $p\in\{0{:}n\}$ in PF (Algorithm 1) [cf. 1, Prop. $20$ ]. Such resampling schemes include the popular multinomial, stratified, residual, and systematic resampling [cf. 14, 25, 28].

The unbiased estimator, from the output $(V_{n}^{(i)},\mathbf{X}^{(i)})_{i=1}^{n}$ of PF (Algorithm 1) run for Feynman-Kac model $(M_{p},G_{p})_{p=0}^{n}$ , is obtained by setting

[TABLE]

for $\phi\in L^{1}(p_{u})$ , which satisfies

[TABLE]

An important point is that particle approximations $\hat{p}_{u}(\mathrm{d}x_{0:n})$ , for $p_{u}(\mathrm{d}x_{0:n})$ through the PF for the model $(M_{p},G_{p})_{p=0}^{n}$ , are not unique [cf. 18, Sect. 2.4.2]. One standard way to obtain a different particle approximation is merely changing the Feynman-Kac model to $(\tilde{M}_{p},\tilde{G}_{p})_{p=0}^{n}$ , but in such a way that

[TABLE]

holds, and running the PF (Algorithm 1) for the new Feynman-Kac model. From (6) and (10), it follows that the unbiased estimator from the PF run for $(\tilde{M}_{p},\tilde{G}_{p})_{p=0}^{n}$ delivers the same unbiased estimation for $p_{u}(\mathrm{d}x_{0:n})$ corresponding to the model $(M_{p},G_{p})_{p=0}^{n}$ . As an example for $(\tilde{M}_{p},\tilde{G}_{p})_{p=0}^{n}$ , consider

[TABLE]

in the sense of a Radon-Nikodým derivative [cf. 90], which always exists if $M_{p}$ and $\tilde{M_{p}}$ admit densities and a support condition holds.

This non-uniqueness opens the door to consider more efficient PF implementations for a particular model and filtering/smoothing problem [cf. 18, 28, 45, 75]. The question of the optimal choice of $(M_{p}^{\star},G_{p}^{\star})_{p=0}^{n}$ for the smoothing problem (i.e. unbiased estimation of $p_{u}(\mathrm{d}x_{0:n})$ ) has been considered in [45]. As the optimal choice is usually not implementable, [45] suggest an adaptive iterative algorithm, based on approximating families of mixtures of normals, in order to approximately find $M_{p}^{\star}$ and $G_{p}^{\star}$ (see also e.g. [50] for a related method). Deterministic approximations, such as Laplace approximations [cf. 83], can also be used as a substitute for the optimal transition $M_{p}^{\star}$ [1] (see also [60]). We emphasise that all the above mentioned approaches to the optimal choice problem achieve unbiased estimation of $p_{u}(\mathrm{d}x_{0:n})$ , as they use appropriately weighted potentials so that (11) holds.

Latent inference with respect to $p(\mathrm{d}x_{0:n})$ is possible through the PF, at least when the dynamics $M_{p}$ can be simulated, by using a ratio estimator targeting $p_{u}(\phi)/p_{u}(1)$ . That is, if $\{\hat{p}_{u,k}\}_{k=1}^{m}$ with $m\geq 1$ are formed999Traditionally in particle filtering [cf. 28], latent inference (12) is done with $m=1$ , possibly with a final resampling to form uniformly weighted particles, but final resampling leads to higher variance of the resulting estimator and is unnecessary here. as in (9) from independent runs of PF (Algorithm 1) for $(M_{p},G_{p})_{p=0}^{n}$ , then

[TABLE]

We remark that the above estimator (12), as mentioned for example in [15, Eq. 1], is an IS analogue of the ‘particle independent Metropolis-Hastings’ (PIMH) [1] chain for latent smoothing. The algorithm based on (12) is completely parallelisable and does not depend on mixing of a chain, and is therefore relatively resilient in the number of particles $N$ . Straightforward consistent estimators to construct confidence intervals are also available [cf. 1, Prop. 23].

2.4. Particle marginal Metropolis-Hastings

The main task for which we are interested is joint $\infty$ -inference with respect to $\pi^{(\infty)}(\mathrm{d}\theta,\mathrm{d}x_{0:n})$ . So far, we only have shown how to perform $\infty$ -inference for $p_{u}^{(\theta,\infty)}(\mathrm{d}x_{0:n})$ and $p^{(\theta,\infty)}(\mathrm{d}x_{0:n})$ , with $\theta$ fixed. Surprisingly [cf. 5, 9, 59, 69], joint inference is possible, using an MCMC known as the particle marginal Metropolis-Hastings (PMMH) [1].

Assuming $q(\theta^{\prime}|\theta)>0$ for all $\theta,\,\theta^{\prime}\in\mathsf{T}$ in the PMMH chain (Algorithm 2), or a similarly mild condition ensuring Harris ergodicity of the chain [cf. 66], the estimator formed from PMMH is strongly consistent: for $f\in L^{1}(\pi^{(\infty)})$ ,

[TABLE]

We remark about ‘Metropolis-Hastings type’ MCMC. The PMMH [1] is used in state space models using PFs, pseudo-marginal MCMC [5, 9, 59, 69] is the general term for the chain used in latent variable models with unbiased estimators, and Metropolis-Hastings MCMC [65, 48] is used in Bayesian models with tractable likelihoods. In fact, it is possible to view these ‘Metropolis-Hastings type’ MCMCs each as a substantiation of the other: one direction follows by viewing the pseudo-marginal MCMC and PMMH as full-dimensional Metropolis-Hastings kernels on an extended state space, while the other direction follows by trivialisation [cf. 5].

3. Accelerations based on an approximation

The Metropolis-Hastings MCMC has served as the backbone of the MCMC revolution for half of the last century [23], while pseudo-marginal MCMC and the PMMH have been quite popular and extensively used in the current century (see [2, Sect. $1.2$ ] for a review). Because of the importance of these MCMCs, there has been considerable interest in their possible acceleration. We focus on acceleration of the PMMH in the following.

Usually by far the most computationally intensive part of the PMMH is running the PF (Algorithm 1), for $(M_{p}^{(\theta,\infty)},G_{p}^{(\theta)})_{p=0}^{n}$ with output $(V^{(i)},\mathbf{X}^{(i)})_{i=1}^{N}$ , to obtain the unbiased estimator

[TABLE]

of the likelihood $L^{(\infty)}(\theta)$ . The idea of acceleration based on approximation is to substitute a computationally cheaper (non-negative unbiased estimator $\hat{L}^{(0)}(\theta)$ of an) approximation $L^{(0)}(\theta)$ for the $\infty$ -likelihood, instead of using $\hat{L}^{(\infty)}(\theta)$ . One would also like to maintain (strong) consistency of the resulting estimator for the $\infty$ -posterior.

3.1. Delayed acceptance and importance sampling

One such popular acceleration algorithm is delayed acceptance (DA) (Algorithm 3) [cf. 59, 8, 16, 17, 41, 61], with $\epsilon\geq 0$ .

We require that almost surely the support condition

[TABLE]

holds, so that the resulting weight $\hat{L}^{(\infty)}(\theta)/\big{(}\hat{L}^{(0)}(\theta)+\epsilon\big{)}$ in Algorithm 3(ii) is guaranteed well-defined. This can be simply achieved always by choosing a regularisation constant111111This will be done in Algorithm 6 given later, and is linked to ‘defensive importance sampling’ [51]. $\epsilon>0$ , leading to asymptotically exact $\infty$ -inference. We note that step (i) in DA (Algorithm 3) effectively acts as a screening stage: only ‘good’ proposals proceed to step (ii), where the expensive $\infty$ -model PF must be run. The resulting DA estimator for the $\infty$ -posterior is the same as that of PMMH, given in (14).

As an alternative to PMMH/DA, we consider MCMC-IS (Algorithm 4) [cf. 24, 37, 38, 48, 73, 1]. Here, for $f:\mathsf{T}\times\mathsf{X}^{n+1}\rightarrow\mathbb{R}$ we have set $f^{(\theta)}(x_{0:n})=f(\theta,x_{0:n})$ .

Assuming the Phase 1 chain is Harris ergodic (e.g. $q(\theta,\theta^{\prime})>0$ for all $\theta,\theta^{\prime}\in\mathsf{T}$ ) and the support condition (17) holds, like the PMMH/DA estimator, the MCMC-IS estimator is strongly consistent [1, Thm. 3]: for $f\in L^{1}(\pi^{(\infty)})$ ,

[TABLE]

Phase 1 of MCMC-IS (Algorithm 1) implements a PMMH (Algorithm 2) targeting marginally

[TABLE]

Phase 2 consists of independent calls of PF (Algorithm 1), and is therefore completely parallelisable, unlike DA (Algorithm 3). This allows for the possibility of substantial additional speedup on a parallel computer [cf. 58].

We remark about an acceleration technique known as ‘early rejection’ [93] for Metropolis-Hastings, that can sometimes be employed if the likelihood takes a special form, described below.121212A similar idea of early cancellation as ‘early rejection’ has been used previously in the exact simulation literature, under the name of ‘retrospective simulation’ [10]. The acceleration technique also applies to DA step (i) and MCMC-IS Phase 1, if $\hat{L}^{(0)}(\theta)=L^{(0)}(\theta)$ almost surely and $\epsilon=0$ . The form required in [93] is that the [math]-likelihood $L^{(0)}(\theta)$ can be written, for example, as

[TABLE]

with $l_{j}^{(\theta,0)}(y_{j})\geq 0$ . In this case, because the likelihood only gets smaller with more components of the product computed, the calculation of the components can be ended and the proposal rejected early in acceptance probability (15) for DA and MCMC-IS, as soon as the partially computed acceptance probability in (15) becomes smaller than the uniformly generated random variable [cf. 93, Sect. 4]. The ‘early rejection’ trick requires a special form for the likelihood, however, and therefore is not always applicable.

3.2. The question of relative efficiency

The delayed acceptance and importance sampling correction are two acceleration alternatives to the standard PMMH, both of which use the same approximation and algorithmic ingredients. The question of choice of alternative methods has been remarked before [17] in the simpler setting of Metropolis-Hastings, without unbiased estimators. Article [1] introduces the IS correction in the general case of unbiased estimators in both Phase 1 and Phase 2, and seeks to compare MCMC-IS with DA in the general setting.

A numerical comparison of the methods is done in [1], where the MCMC-IS approach was found to work slightly better than DA in experiments in SSMs, even without parallelisation. As an example of a computationally intensive experiment, a stochastic volatility model was considered with observation consisting of real data from daily financial index returns spanning two decades. Laplace approximations were used to approximate the [math]-likelihood, and were used as well in the IS correction, namely, for the approximation to the optimal choice131313as discussed in Section 2.3 of Feynman-Kac model for the smoothing problem for $p_{u}^{(\theta,\infty)}(\mathrm{d}x_{0:n})$ . With all methods making intelligent use of the Laplace approximations, MCMC-IS performed significantly better than PMMH or DA in the experiment.

In additional to the experiments, many additional potential enhancements were suggested in [1] which would improve the computational efficiency of MCMC-IS in practice, relative to DA acceleration of PMMH, even further. For example, the Phase 2 IS weights do not need to be calculated during the burn-in phase141414Additionally, the debiasing tricks [cf. 39] may be effectively and efficiently used. and for thinned out samples of the chain151515Thinning [cf. 72] denotes the procedure, in which only every $k^{th}$ sample of the Markov chain is kept, with say $k=10$ , in order to decrease the auto-correlation of samples., nor for repeated samples of the chain if the jump chain161616the chain formed formed from the original chain, consisting only of the accepted states of the original chain [cf. 27, 1] is used. As well, as previously mentioned, Phase 2 admits a straightforward parallelisation for calculation of the more expensive IS weights, which significantly increases the scalability and efficiency of MCMC-IS.

3.3. Peskun and covariance orderings of asymptotic variances

An estimator $E_{m}(f)$ is said to satisfy a central limit theorem (CLT), if

[TABLE]

In this case, we call $\sigma^{2}(f)$ the asymptotic variance of the estimator.

Without taking into account computational factors previously mentioned (which generally support the use of MCMC-IS; see also Section 4.6), and considering just the statistical efficiency of the estimators in terms of the asymptotic variance, it was found in [2] through artificially constructed toy examples that either MCMC-IS or PMMH/DA may do arbitrary better than the other. Moreover, the examples seemed to indicate that MCMC-IS might do better in cases of practical interest, with multi-modal targets, a phenomenon remarked previously about MCMC-IS and Metropolis-Hastings [e.g. 37]. Proving that the IS acceleration is usually ‘better’ than DA is of course a separate matter, which can not be done based on experiments or examples alone.

We first introduce some notation and terminology. A Markov kernel $K$ on $(\mathsf{X},\mathcal{X})$ is said to be reversible with respect to a probability $\mu$ , if for all $A,B\in\mathcal{X}$ ,

[TABLE]

We also define the Dirichlet form

[TABLE]

for $g\in L^{2}(\mu)$ , where $\left\langle g_{1},g_{2}\right\rangle_{\mu}\mathrel{\mathop{:}}=\int g_{1}(x)g_{2}(x)\mu(\mathrm{d}x)$ , $Kg(x)\mathrel{\mathop{:}}=\int K(x,\mathrm{d}x^{\prime})g(x^{\prime})$ and $(1g)(x)=g(x)$ .

The famous Peskun ordering [74, 95] says that if

[TABLE]

where $K$ and $L$ are two Markov kernels, both reversible with respect to a probability $\mu$ , then

[TABLE]

where $\sigma_{K}^{2}(f)$ and $\sigma_{L}^{2}(f)$ denote the asymptotic variances of the $K$ and $L$ chains, respectively.

Consider next a popular Peskun ‘type’ comparison result for asymptotic variances of reversible chains, known as the covariance ordering171717Though not mentioned by name, it was shown already in [95, Proof of Lem. 3] that the Peskun ordering is equivalent with the ‘covariance’ ordering. [67]: if $K$ and $L$ are two Markov kernels, both reversible with respect to a probability $\mu$ , and if

[TABLE]

then

[TABLE]

Compared to the Peskun ordering, the covariance ordering can be more useful in practice, as the criterion can distinguish better between chains on general state spaces. For example, some chains vanish along the diagonal, in which case (19) may be useless, but (21) may still be able to distinguish between these chains [cf. 67, 68].

As a simple application of the covariance ordering, let us consider the case of PMMH and DA, which are both reversible with respect to the same invariant measure (see [8] or Section 3.5). Using the identity

[TABLE]

which holds for any $\mu$ -reversible kernel $L$ , and that the product of the acceptance probabilities (15) and (16) in DA (Algorithm 3) is less than or equal to the acceptance probability (13) in PMMH (Algorithm 2), it can be shown [cf. 8] that the covariance ordering implies

[TABLE]

3.4. Peskun type ordering for importance sampling correction

Article [2] is concerned with extending the covariance ordering to chains $K$ and $L$ reversible with respect to probabilities $\Pi^{(0)}$ and $\Pi^{(\infty)}$ , where $\Pi^{(0)}$ and $\Pi^{(\infty)}$ may be different.

Suppose then that $K$ and $L$ are Harris ergodic chains on a space $(\mathsf{X},\mathcal{X})$ , where $K$ is $\Pi^{(0)}$ -reversible and $L$ is $\Pi^{(\infty)}$ -reversible. Suppose further that the Radon-Nikodým derivative181818This is the function $w$ satisfying $\Pi^{(0)}(wg)=\Pi^{(\infty)}(g)$ for all $g\in L^{1}(\Pi^{(\infty)})$ .

[TABLE]

exists. Let $\underline{c},\,\bar{c}\geq 0$ be constants such that

[TABLE]

for all $x\in\mathsf{X}$ and $g\in L^{2}(\Pi^{(0)})$ . Then [2, Thm. 2], for all $f\in L^{2}(\Pi^{(\infty)})$ with $\bar{f}\mathrel{\mathop{:}}=f-\Pi^{(\infty)}(f)$ , we have

[TABLE]

If $\Pi^{(0)}=\Pi^{(\infty)}$ , then it is direct to see that (23) simplifies to the covariance ordering (22) given earlier. Versions of the orderings (23-24) also hold for when the marginal weight is bounded in a latent variable setting [2, Thm. 5], and for self-normalised estimators using jump chain representation and unbiased estimators [2, Thm. 12] to compare with pseudo-marginal type MCMC. We discuss a particular implication of these orderings in the next section, namely MCMC-IS (algorithm 4) compared to PMMH (Algorithm 2) and DA (Algorithm 3).

3.5. Comparison results

We are now ready to compare MCMC-IS (Algorithm 4) with PMMH (Algorithm 2) and DA (Algorithm 3) in terms of the asymptotic variance. For simplicity, we assume deterministic approximation for the [math]-likelihood, that is, $\hat{L}^{(0)}(\theta)=L^{(0)}(\theta)$ almost surely.191919For the general case for $\hat{L}^{(0)}(\theta)$ , see [2, Thm. 14]. We note that the MCMC-IS chain is $\Pi^{(0)}$ -reversible, while the PMMH and DA chains are both $\Pi^{(\infty)}$ -reversible, with probabilities defined in the following.

Article [2] shows how a comparison can be made when the (marginal) weight between the approximate and exact model posteriors $w$ (or $\dot{w}$ ) is bounded (the weights $w$ and $\dot{w}$ are defined below). This follows from the extension of the covariance ordering to the IS context with unbiased estimators, mentioned earlier.

We first need to define some notation. Let $Q_{\theta}^{(\infty)}(\mathrm{d}{\mathbf{x}}^{(1:N)},\mathrm{d}v^{(1:N)})$ denote the law of the output $({\mathbf{X}}^{(1:N)},V^{(1:N)})$ of the PF (Algorithm 1) for the model $(M_{p}^{(\theta,\infty)},G_{p}^{(\theta)})_{p=0}^{n}$ . The full invariant probability of the PMMH (Algorithm 2) is then given by

[TABLE]

where $c_{\infty}$ is a normalising constant. The full invariant probability of the IS corrected chain (Algorithm 4) is

[TABLE]

where $c_{0}$ is a normalising constant. We set for a function $f:\mathsf{T}\times\mathsf{X}^{n+1}\to\mathbb{R}$ ,

[TABLE]

Assuming

[TABLE]

almost surely, for some $\epsilon\geq 0$ , the weights

[TABLE]

correspond to the Radon-Nikodým derivatives between the approximate and exact model full and marginal posteriors.

Let us now describe a CLT for MCMC-IS. As before, for a function $f\in L^{2}(\pi^{(\infty)})$ , we set $\bar{f}\mathrel{\mathop{:}}=f-\pi^{(\infty)}(f)$ . By [1, Thm. 7(i)], for $f\in L^{2}(\pi^{(\infty)})$ the MCMC-IS estimator (18) satisfies a CLT, with a formula for the MCMC-IS asymptotic variance given by

[TABLE]

assuming $\sigma_{IS}^{2}(f)<\infty$ , support condition (17) holds, and the marginal chain $(\Theta_{k})_{k\geq 1}$ of MCMC-IS (Algorithm 4) is Harris ergodic202020E.g. $q(\theta,\theta^{\prime})>0$ for all $\theta,\theta^{\prime}\in\mathsf{T}$ . and aperiodic212121See for example [66] for this and other definitions.. Here, $\sigma_{IS,1}^{2}(f)$ is the asymptotic variance of the marginal chain $(\Theta)_{k\geq 1}$ , that is,

[TABLE]

and

[TABLE]

with

[TABLE]

Note the decomposition of the MCMC-IS asymptotic variance (26) into marginal MCMC and IS correction components, which may be helpful in questions of tuning and allocation of computational resources. A similar decomposition is not expected to hold for the DA asymptotic variance.

We now state the comparison results between MCMC-IS and PMMH/DA. For functions $f\in L^{2}(\pi^{(\infty)})$ , such that the CLT and conditions given above for MCMC-IS hold, and assuming the PMMH and DA chains are Harris ergodic, we have the following comparison result [2, Thm. 14], with $\sigma_{L}^{2}$ equal to $\sigma_{PM}^{2}$ or $\sigma_{DA}^{2}$ :

[TABLE]

Note that $\sup\dot{w}\leq\sup w$ . We have, moreover, if also $\sup w<\infty$ ,

[TABLE]

The results show that the asymptotic variance of MCMC-IS and PMMH/DA can be related up to additive and multiplicative constants, which can be informative by (27) in practical cases where the marginal weight $\dot{w}$ is bounded, where $\dot{w}$ relates the ratio of likelihoods. We note that (29) is usually not helpful since a positive lower bound on $w$ is usually not possible, while (27) and (28) do not require a positive lower bound, and are therefore more generally applicable, providing theoretical guarantees for MCMC-IS in terms of PMMH/DA. Another nice facet of (27-29) is that the function $f\in L^{2}(\Pi^{(\infty)})$ is allowed to be a function on $\mathsf{T}\times\mathsf{X}^{n+1}$ , not only on $\mathsf{T}$ .

Also shown in [2] is the not too surprising fact that geometric ergodicity of the MCMC-IS augmented chain is inherited by its marginal chain. This relates the fact that the convergence and mixing of the MCMC-IS chain is not affected by the noise in the Phase 2 unbiased estimators, unlike PMMH and DA, which are very dependent on the noise, and are not geometrically ergodic if the unbiased estimator is unbounded [cf. 5, 2]. Of course, the asymptotic variance (26) of the MCMC-IS estimator (18) depends on the noise, but it seems it is not as harmful in the output estimator compared to in the acceptance ratios (13) and (16) of PMMH and DA, respectively. Besides convergence and mixing, geometric ergodicity is also likely helpful for example in estimation of the asymptotic variance [cf. 32], as well as in verifying convergence of adaptive MCMC schemes [cf. 6], at least based on the existing theory.

There is room for further theoretical development. For example, quantification of the error of MCMC-IS and of the asymptotic variance, could be investigated along the lines of [32, 79]. Also, in terms of non-asymptotic error bounds, results for MCMC [e.g. 100, 63, 81] could likely be extended to MCMC-IS.

4. Bayesian inference for state space models with diffusion dynamics

The PF (Algorithm 1) for the Feynman-Kac model $(M_{p}^{(\theta,\infty)},G_{p}^{(\theta)})$ requires that the samples can be drawn from the Markov transition kernels $M_{p}^{(\theta,\infty)}$ . However, as discussed at the end of Section 2.1, in many settings important for real applications, the assumption that the dynamics can be simulated does not hold.

We consider the case where the model $(M_{p}^{(\theta,\infty)},G^{(\theta)})$ stems from a discretely and partially observed Itô diffusion process. Suppose $(X_{t}^{\prime})_{t\geq 0}$ solves an Itô stochastic differential equation of the form

[TABLE]

where $\{W_{t}\}_{t\geq 0}$ is a standard Brownian motion. As in Section 2.1, we assume that there is some observational process $(Y_{t}^{\prime})_{t\geq 0}$ , and that observations $\{Y_{t_{p}}^{\prime}\}_{p=0}^{n}$ are obtained at discrete times $\{t_{p}\}_{p=0}^{n}$ . With $X_{p}\mathrel{\mathop{:}}=X_{t_{p}}^{\prime}$ and $Y_{p}\mathrel{\mathop{:}}=Y_{t_{p}}^{\prime}$ , and with $G_{p}^{(\theta)}(x_{p})\mathrel{\mathop{:}}=g_{p}^{(\theta)}(y_{p}|x_{p})$ , we obtain a model $(M_{p}^{(\theta,\infty)},G_{p}^{(\theta)})_{p=0}^{n}$ which additionally satisfies the SSM conditions (7-8).

In some, essentially one-dimensional diffusion settings, where the Lamperti transformation [cf. 70] can be applied, $\infty$ -inference is possible for $p^{(\theta,\infty)}$ [10, 11, 31] and $\pi^{(\infty)}$ [87, 97]. Article [3] attempts to extend to more settings $\infty$ -inference for $p^{(\theta,\infty)}$ and $\pi^{(\infty)}$ in a computationally feasible way. The approach of [3] is based on Euler approximations of the dynamics [cf. 56], multilevel Monte Carlo (MLMC) [49, 36], a particle filter coupling [53], debiasing tricks for MLMC [64, 77], and an IS type correction [1]. We introduce each of these in turn in the following.

4.1. Euler approximations

The Euler approximation amounts to defining a discretisation size $h_{\ell}\propto 2^{-\ell}$ for $\ell\in\mathbb{N}\cup\{0\}$ , and replacing the dynamics of the latent process $(X_{t}^{\prime})_{t\geq 0}$ with a discrete-time Markov chain,

[TABLE]

Here, $(W_{t})_{t\geq 0}$ is a standard Brownian motion, so that $W_{t+h_{\ell}}-W_{t}\sim\mathrm{N}(0,h_{\ell})$ is independent of $X_{u}^{\prime}$ , $u\leq t$ .

The approximate dynamics corresponds to an approximate transition $M_{p}^{(\theta,\ell)}$ , which, together with the conditionally independent observations, results in a model $(M_{p}^{(\theta,\ell)},G_{p}^{(\theta)})$ satisfying the SSM conditions (7-8), with $\ell$ -smoother $p^{(\theta,\ell)}(\mathrm{d}x_{0:n})$ given in (3) in Section 2.2, and with joint $\ell$ -posterior $\pi^{(\ell)}(\mathrm{d}\theta,\mathrm{d}x_{0:n})$ given in (5).

Joint $\ell$ -inference for $\pi^{(\ell)}$ is possible using PMMH (Algorithm 2) [1], which has been quite popular in the setting of diffusions [cf. 42]. Another $\ell$ -inference method [53], which uses PMMH together with a ‘multilevel’ decomposition, is discussed in the following section. We reiterate that, in distinction to these methods, the goal in [3] is to develop a $\infty$ -inference method (which is also computationally efficient).

4.2. Multilevel Monte Carlo

The idea of MLMC is based on telescoping sums, where each summand is coupled in such a way that leads to lower variance of the resulting estimator [49, 36]. The multilevel decomposition used in [53], for $\ell_{F}$ -inference in partially observed diffusions, is based on the telescoping sum in terms of expectations of normalised probabilities,

[TABLE]

with $\ell_{F}\geq 1$ ideally taken quite large. PMMH chains are run at level $\ell$ and at level $\ell-1$ in each summand, and are coupled to each other using the ‘approximate coupling’ described below.

In [3], such a telescoping sum is used not to target an expectation (and where the normalising constants must be simultaneously estimated in each summand), but rather an integral taken with respect to an unnormalised $\ell_{F}$ -smoother,

[TABLE]

with $\ell_{F}\geq 1$ ideally taken quite large. The quality of the approximation as measured by the variance depends on the coupling used for each increment

[TABLE]

The algorithm used in [3] to unbiasedly estimate this difference is given in Algorithm 5, which we refer to as the ‘delta PF’ ( $\Delta$ PF). The coupling used is the ‘approximate coupling’ of [53]. This coupling is based on a change of measure of the Feynman-Kac model on a joint path space, which, together with an importance sampling correction of the particle filter output, leads to the $\Delta$ PF.

4.3. Coupling of Feynman-Kac models

Suppose $(M_{p}^{(\theta,\ell)},G_{p}^{(\theta)})_{p=0}^{n}$ and $(M_{p}^{(\theta,\ell-1)},G_{p}^{(\theta)})_{p=0}^{n}$ are two Feynman-Kac models. We describe a coupling of them as follows. For some fixed $\ell\geq 1$ , $\check{M}_{p}^{(\theta,\ell)}(\check{x}_{0:p-1},\mathrm{d}\check{x}_{p})$ is assumed to be a coupling of the $\ell$ and $\ell-1$ level transitions, that is,

[TABLE]

for $A\in\mathcal{B}(\mathsf{X})$ and with the notation $\check{x}_{0:p}=(x_{0:p}^{(\ell)},x_{0:p}^{(\ell-1)})$ denoting an element in the space $\mathsf{X}^{2(p+1)}$ , and we set

[TABLE]

The dynamics $\check{M}_{p}^{(\theta,\ell)}$ is typically obtained in the diffusion context by using a common Brownian path for mesh discretisation levels $\ell$ and $\ell-1$ . Other choices for $\check{G}_{p}^{(\theta)}$ are possible then the choice (31) used in [3]. The important point is that $\check{G}_{p}^{(\theta)}(\check{x}_{0:p})>0$ whenever $G_{p}^{(\theta)}(x_{0:p}^{(\ell)})>0$ or $G_{p}^{(\theta)}(x_{0:p}^{(\ell-1)})>0$ . This ensures that the estimator $\Delta^{(\theta,\ell)}(\phi)$ from the $\Delta$ PF (Algorithm 5) is unbiased [3, Prop. 3]: for bounded $\phi:\mathsf{X}^{n+1}\rightarrow\mathbb{R}$ ,

[TABLE]

We can then estimate $p_{u}^{(\theta,\ell_{F})}$ unbiasedly using MLMC. Namely,

[TABLE]

where

[TABLE]

with $\{\hat{p}_{u,i}^{(\theta,0)}(\phi)\}_{i=1}^{m_{0}}$ independently run versions of the estimator $\hat{p}_{u}^{(\theta,0)}(\phi)=\sum_{i=1}^{N}V^{(i)}\phi(\mathbf{X}^{(i)})$ from the output $(V^{(1:N)},\mathbf{X}^{(1:N)})$ of the PF (Algorithm 1) run for the model $(M_{p}^{(\theta,0)},G_{p}^{(\theta)})_{p=0}^{n}$ , and where

[TABLE]

with $\{\Delta_{i}^{(\theta,\ell)}(\phi)\}_{i=1}^{m_{\ell}}$ estimators formed from independent runs of the $\Delta$ PF (Algorithm 5) run for the model $(\check{M}_{p}^{(\theta,\ell)},\check{G}_{p}^{(\theta)})_{p=0}^{n}$ .

The approach based on (32) allows for efficient MLMC estimation of the unnormalised $\ell_{F}$ -smoother $p_{u}^{(\theta,l_{F})}$ , over the latent states. If we were content with joint $\ell_{F}$ -inference, then we could apply already the IS type correction of MCMC as in Algorithm 4, with regularised ‘likelihood’ estimate

[TABLE]

in the acceptance ratio (15), with $\epsilon\geq 0$ , and with IS weights

[TABLE]

which are allowed to take negative values [cf. 1]. This would provide an efficient MLMC alternative method to the PMMH or the algorithm in [53] for inference with respect to $\pi^{(l_{F})}$ . Instead, we wish to go one (infinite!) step further, and target $\pi^{(\infty)}$ .

4.4. Debiasing techniques

Debiased MLMC [64, 77, 96] is based on randomising the running level used in deterministic MLMC (with a reweighting), as follows.

We assume that $(p_{\ell})_{\ell\geq 1}$ is a probability mass function (p.m.f.) on $\mathbb{N}$ satisfying $p_{\ell}>0$ for all $\ell\geq 1$ . We also assume that

[TABLE]

for all bounded $\phi:\mathsf{X}^{n+1}\rightarrow\mathbb{R}$ , which is not too difficult to verify in our setting under certain boundedness assumptions, because of the known convergence properties of the Euler approximation [cf. 56]. With $L\sim(p_{\ell})$ , the single-term debiased MLMC estimator of [77] in our case is given by $p_{L}^{-1}\Delta^{(\theta,L)}(\phi)$ , which satisfies

[TABLE]

Adding an independent ‘zeroth level’ estimate $\hat{p}_{u}^{(\theta,0)}(\phi)\mathrel{\mathop{:}}=\sum_{i=1}^{N}V^{(i)}\phi(\mathbf{X}^{(i)})$ , formed from the output $(V^{(i)},\mathbf{X}^{(i)})_{i=1}^{N}$ of PF (Algorithm 1) run for the model $(M_{p}^{(\theta,0)},G_{p}^{(\theta)})_{p=0}^{n}$ , we set

[TABLE]

to obtain that

[TABLE]

By using a self-normalised estimator to take care of the normalising constant, this already allows for consistent inference over the latents. That is, as in (12) of Section 2.3, if $\{\tilde{\Delta}_{k}^{(\theta)}\}_{k=1}^{m}$ for $m\geq 1$ are independently run to form estimator functionals of the form (33), then

[TABLE]

as $m\to\infty$ [3, Prop. 7].

4.5. Joint inference using importance sampling type correction

Recall that our original goal was joint $\infty$ -inference (for $\pi^{(\infty)}$ ). To do this, we will use Algorithm 6, which is similar to Algorithm 4, but which uses a multilevel IS type correction based on the randomised $\Delta$ PF output. Consistency was also detailed in [1] for IS type correction involving negative weights as in Algorithm 6, which can occur frequently in the multilevel context which we consider here.

The likelihood support condition (17) mentioned for Algorithm 4 can be achieved by using $\epsilon>0$ .222222It is closely linked to ‘defensive importance sampling’ [51], but its optimal choice in terms of efficiency is not known. We also need finiteness of the variance of the randomised $\Delta$ PF, $p_{L}^{-1}\Delta^{(\theta,L)}$ , in order to guarantee that the debiased MLMC works correctly [cf. 77], and that the MCMC-IS (Algorithm 6) can have finite asymptotic variance [cf. 3, Prof. 13]. That is, we need to show that

[TABLE]

is finite, uniformly in $\theta\in\mathsf{T}$ . This requires showing that the variance of $\Delta^{(\theta,\ell)}(\phi)$ decays at a sufficient rate relative to $p_{\ell}$ as $\ell$ increases.

Under some standard (stringent) assumptions used elsewhere in the literature, the results of the technical analysis are formulated in [3, Cor. $9$ ]. In the case of standard Euler approximation, the result says that

[TABLE]

where $C>0$ is a constant which does not depend on $N_{\ell}\geq 1$ , $\ell\geq 1$ , or $\theta\in\mathsf{T}$ , where $N_{\ell}$ particles are used in the $\Delta$ PF run at level $\ell$ . Hence, with $N_{\ell}=N$ constant, by taking $p_{\ell}\propto 2^{-r\ell}$ , with $r<1$ , (34) will be finite. More generally, if $N_{\ell}\propto 2^{\rho\ell}$ with $\rho\in[0,1]$ , then we see that we can take $p_{\ell}\propto 2^{-r\ell}$ with $r<1+\rho$ , so that (34) will be finite.

The assumptions needed to prove the bound (35) in [3] are on the diffusion [e.g. 56], in terms of uniform ellipticity and globally Lipschitz diffusion terms, as well as on the Feynman-Kac model [e.g. 18], in terms of globally Lipschitz potentials and transitions and lower and upper bounded potentials. The results of the analysis are based on a global error martingale decomposition [cf. 18] in terms of the local sampling error of the particle filter run for the coupled Feynman-Kac model, and on an analysis of the $\Delta$ PF in the diffusion context.

4.6. Computational efficiency and allocations

We have seen that under some assumptions, the finiteness of the variance of the randomised $\Delta$ PF can be verified for any $N_{\ell}\geq 1$ and for sufficiently heavy-tailed $(p_{\ell})$ [3, Cor. 9]. However, the use of a heavy-tailed p.m.f. $(p_{\ell})$ can lead to excessive use of computational resources, and we must therefore try to use thinner-tailed p.m.f.s $(p_{\ell})$ and optimal number of particles $N_{\ell}$ at level $\ell$ in order to minimise the inverse relative efficiency (IRE) [40] which measures the computational cost.

Let $(\Theta_{k})_{k\geq 1}$ be the marginal Markov chain of Algorithm 6, and $L_{k}\sim(p_{\ell})$ for $k\geq 1$ . With terminology similar to [40], who consider the i.i.d.232323independent and identically distributed case for $(\tau_{k})_{k\geq 1}$ , we assume that the total computational cost to run Algorithm 6 for $m$ iterations is

[TABLE]

where $(\tau_{k})_{k\geq 1}$ are conditionally independent positive random variables given $(\Theta_{k},L_{k})_{k\geq 1}$ , where $\tau_{k}$ depends only on $\Theta_{k}$ and $L_{k}$ . Given some budget $\kappa\in\mathbb{R}_{\geq 0}$ , the realised length of the chain is

[TABLE]

Then, if for some number $\tau>0$ ,

[TABLE]

and if the MCMC-IS estimator satisfies a CLT with asymptotic variance $\sigma^{2}(f)$ , then [40, 3]

[TABLE]

and $\tau\sigma^{2}(f)$ is the IRE. We thus extend the discussion of [40] to non-i.i.d. $(\tau_{k})_{k\geq 1}$ .

Using this computational efficiency framework, similar to [77] who consider the i.i.d. case in traditional MLMC, it is possible to consider the matter of computational complexity and optimal allocation of resources in Algorithm 6. Suppose a CLT holds for the MCMC-IS estimator of Algorithm 6 with finite asymptotic variance [cf. 3, Prop. 13]. Let $\epsilon>0$ and $\delta\in(0,1)$ be given. In order to have

[TABLE]

by the Chebyshev inequality and using the standard $m^{-1}$ mean squared error convergence rate for MCMC, we need that $m$ is of order $\epsilon^{-2}$ , denoted $m=O(\epsilon^{-2})$ .242424That is, with $m=m(\epsilon)$ , we have $m(\epsilon)/\epsilon^{-2}\to C$ as $\epsilon\to 0$ , some $C>0$ . The question is then how we can minimise the computational complexity given by $\mathscr{C}(m)$ when $m=O(\epsilon^{-2})$ , by adjusting $p_{\ell}$ and $N_{\ell}$ , while keeping the variance (34) finite.252525Besides for the debiasing [77] to work, the asymptotic variance [see 3, Prop. 13] of the MCMC-IS estimator of Algorithm 6 has a part from the marginal MCMC, as well as from the IS type correction. The latter is finite if the variance (34) is finite. Assuming

[TABLE]

where $C$ does not depend on $\theta$ or $\ell$ , then it is shown in [3, Prop. 24] that for all $q>2$ , $\eta>1$ , the computational cost

[TABLE]

can be obtained for sufficiently small $\epsilon$ , if $p_{\ell}$ and $N_{\ell}$ are chosen to be

[TABLE]

for $\rho\in[0,1]$ . This choice for $p_{\ell}$ and $N_{\ell}$ ensures that the variance (34) is finite, and suggests262626by disregarding the factor $\ell[\log_{2}(\ell+1)]^{\eta}$ in (37) the choice

[TABLE]

for $\rho\in[0,1]$ . The computational cost (36) is the same as that of [77, Prop. 4] for the single-term estimator in traditional, randomised MLMC. It is also very close to

[TABLE]

(recall $q>2$ ), which is the well-known computational complexity order [36] in the traditional, deterministic MLMC.

The result (37) shows that in case of Euler approximation, there is in fact a parametrisation of recommended choices for particle number $N_{\ell}$ and p.m.f. $(p_{\ell})$ , all of which share the same order of computational complexity to obtain a given precision, under certain assumptions such as previously explained for $\tau_{k}$ . Then (37) (or the simplified suggestion (38)) should lead to a proper usage of computational resources, in order to keep both the asymptotic variance and the total cost jointly small, and therefore the IRE small. The order of computational complexity is the same along the parametrisation in terms of $\rho\in[0,1]$ , but it is still unknown whether a certain choice of $\rho$ will usually lead to the best choice for $N_{\ell}$ and $(p_{\ell})$ . In an experiment in [3] concerning a geometric Brownian motion, the choice $\rho=0$ performed better than the choice $\rho=1$ in the allocation (37). We leave, for now, the optimal choice of $\rho$ for future research and experiment.

5. Inference via approximate Bayesian computation

We assume a Bayesian model as in Section 1.2, with fixed observation denoted $y^{*}\in\mathsf{Y}$ , prior $\mathrm{pr}(\theta)$ , and likelihood $L(\theta)=p^{(\theta)}(y^{*})$ , which is assumed to be intractable. Although the data distribution $p^{(\theta)}(\,\cdot\,)$ can not be evaluated, we assume that it is possible to sample data $y\sim p^{(\theta)}(\,\cdot\,)$ from it. Let $d(y,y^{\prime})$ be a pseudo-metric272727 That is, for all $y_{1},y_{2},y_{3}\in\mathsf{Y}$ , it holds $d(y_{1},y_{2})\geq 0$ , $d(y_{1},y_{2})=d(y_{2},y_{1})$ , and $d(y_{1},y_{3})\leq d(y_{1},y_{2})+d(y_{2},y_{3})$ . For example, $d(y_{1},y_{2})=\left\lVert s(y_{1})-s(y_{2})\right\rVert$ where $s:\mathsf{Y}\to\mathbb{R}^{n_{y}}$ is some (summary) statistic [cf. 76]. on $\mathsf{Y}^{2}$ . With tolerance $\epsilon>0$ , we then define

[TABLE]

Approximate Bayesian computation (ABC) (see [92] for a review) is based on using the family $\mathscr{P}_{1/\epsilon}\mathrel{\mathop{:}}=\{p^{(\theta,1/\epsilon)}\}_{\theta\in\mathsf{T}}$ of approximate probabilities, where

[TABLE]

with ABC likelihood,

[TABLE]

Then $\mathscr{P}_{1/\epsilon}$ become families of increasingly ‘better’ approximations as $\epsilon$ goes to [math]. However, it is important to keep in mind that it is only approximate even in the limit, since in general,

[TABLE]

The ABC posterior is then given by

[TABLE]

A method of inference for the ABC posterior which we consider is the ABC-MCMC (Algorithm 7), as suggested by [62], which may also be viewed as a pseudo-marginal MCMC [5], with

[TABLE]

5.1. Choosing the tolerance in ABC-MCMC

The choice of tolerance $\epsilon$ is a difficult question in ABC-MCMC [cf. 91]. Namely, a large choice of $\epsilon$ leads to large bias, but to computational inefficiency if $\epsilon$ is small. To see this, note that if $\epsilon$ is small, then a proposed state is hardly ever accepted, since $\mathbf{1}\big{\{}d(Y_{k}^{\prime},y^{*})\leq\epsilon\big{\}}$ is usually [math]. If $\epsilon$ is large, then $L^{(1/\epsilon)}(\theta)\approx 1$ is nearly constant in $\theta$ and so ABC-MCMC is essentially targeting the prior model, which is uninformative for Bayesian posterior inference.

Article [4] attempts to deal with the issue of tolerance choice in ABC-MCMC, by using an inflated and adaptively tuned tolerance parameter in order to maximise efficiency of the MCMC, and then to use a post-correction, importance sampling step, to remove bias [98] as well as to quantify uncertainty with proposed approximate confidence intervals.

The tolerance adaptive ABC-MCMC (Algorithm 8), which is run during burn-in for some number of iterations $n_{b}$ , is an adaptive MCMC [cf. 6] targeting a user-specified overall acceptance probability $\alpha^{*}\in(0,1)$ . In experiments in [4], a value of $\alpha^{*}=10\%$ was used, which ensures sufficient mixing and number of different samples from the MCMC. We provide convergence theorems in [4] for the adaptive algorithm under two sets of assumptions. The simpler set of assumptions essentially requires that the proposal $q(\theta^{\prime}|\theta)>0$ is uniformly bounded away from zero, and $\epsilon_{k}$ is bounded away from zero for all $k\geq 1$ almost surely. The former assumption on $q$ is removed in the more general set of assumptions. Removing the assumption on $\epsilon_{k}$ might be possible, based on projections [cf. 4].

5.2. Approximate confidence intervals

An approximate estimator for the asymptotic variance of the post-corrected ABC-MCMC has been suggested in [4, Alg. 6], which can be used for the construction of (approximate) confidence intervals.

Suppose that $\hat{\tau}_{\epsilon_{0}}(f)$ is an estimate for the integrated auto-correlation time for ABC-MCMC( $\epsilon_{0}$ ),

[TABLE]

perhaps using a windowed sample auto-correlation estimator [cf. 47]. Also define the following estimator for the function variance,

[TABLE]

The approximate confidence interval then takes the form

[TABLE]

where $\beta>0$ corresponds to the standard normal quantile.

We remark that there is some theoretical backing for the approximate confidence interval, based on an exact formula for the integrated auto-correlation time of the post-corrected chain [4, Thm. 7]. The relevance of the approximate confidence intervals is also verified in some experiments in [4].

5.3. Adaptive ABC-MCMC with post-correction

The approach of [4] then takes the form of Algorithm 9. In regards to the adaptive ABC-MCMC, also the proposal covariance matrix $q$ is best updated as in [46, 3]. The estimator $E_{\epsilon_{0},\epsilon}(f)$ can be calculated effortlessly for all $\epsilon\in(0,\epsilon_{0}]$ by sorting beforehand the samples $\Theta_{k}$ according to their corresponding distances $T_{k}$ .

In experiments in [4], for example in a Lotka-Volterra model involving two reagents and three reactions [cf. 13], it was found that Algorithm 9 delivers a robust approach to inference in ABC models. In particular, the post-processing estimators were found to be competitive with direct ABC-MCMC with pre-tuned tolerance and starting value, the approximate confidence interval provided good coverage, and the adaptive ABC-MCMC allowed for essentially arbitrary initial choice of tolerance and starting value from the prior.

Compared to direct ABC-MCMC( $\epsilon$ ), the approach based on slightly inflated tolerance and post-correction, was shown to be competitive in experiments in [4]. An upper bound for the asymptotic variance of the ABC-MCMC( $\epsilon_{0}$ ) with post-correction to $\epsilon$ , in terms of that of a direct ABC-MCMC( $\epsilon$ ), is given in [4, Thm. 8]. It is a direct application of the Peskun type ordering for importance sampling [2] stated previously in (23), where the upper bound guarantee becomes an equality as $\lvert\epsilon_{0}-\epsilon\rvert\to 0$ .

5.4. Convergence of the tolerance adaptive ABC-MCMC

We briefly discuss the general approach to the convergence proofs of the tolerance adaptive ABC-MCMC. To obtain a setup fitting within the framework of stochastic approximation [cf. 3, 4], we write the tolerance adaptation update in Algorithm 8(iv) as

[TABLE]

where $H_{\epsilon}(\theta,\theta^{\prime},y^{\prime})\mathrel{\mathop{:}}=\alpha^{*}-\alpha_{\epsilon}(\theta,\theta^{\prime},y^{\prime}),$ with $\alpha_{\epsilon}$ defined in (39), with ’mean field’

[TABLE]

and centred ‘noise’ sequence $\eta_{k+1}\mathrel{\mathop{:}}=H_{\epsilon_{k}}(\Theta_{k},\Theta_{k}^{\prime},Y_{k}^{\prime})-h(\epsilon_{k}).$ In this common framework for stochastic approximation algorithms, we can apply [4, Theorem 2.3], which implies that the key lemma for the proof of convergence of the tolerance adaptive ABC-MCMC (Algorithm 8) essentially reduces to showing that the noise sequence $\eta_{k}$ is asymptotically controlled,

[TABLE]

[4, Lemma 20]. This relies on various ancillary results, such as monotonicity of the map $\epsilon\mapsto h(\epsilon)$ , continuity and contraction properties of the Markov kernels, and a generalisation of the ‘proposal augmentation’ from Metropolis-Hastings chains [85, 82] to ‘proposal-rejection’ chains. Here, we call a kernel $K$ a ‘proposal-rejection’ kernel if it is reversible and can be written as

[TABLE]

where $\alpha(\theta,\theta^{\prime})\in[0,1]$ is a measurable function on $\mathsf{T}^{2}$ . By marginalising away the auxiliary variable in ABC-MCMC( $\epsilon$ ) (7), we obtain such a ‘proposal-rejection’ kernel, with

[TABLE]

which is clearly not a Metropolis-Hastings kernel any longer.

Non-standard theoretical challenges of the tolerance adaptive ABC-MCMC (Algorithm 8) are that the invariant measure $\pi^{(1/\epsilon_{k})}$ is changing at each iteration, and that the chain is technically a pseudo-marginal. Regarding this latter point, however, we do have simplification to independent refreshments of the auxiliary variable $y^{\prime}$ , because of the use of a simple cut-off function $\mathbf{1}\big{\{}d(\,\cdot\,,y^{*})\leq\epsilon\big{\}}$ in the acceptance ratio. As mentioned in Section 5.1, essentially, the convergence theorems for the adaptation are formulated in a simpler setting of uniform ergodicity, as well as for simultaneously geometrically ergodic ‘proposal-rejection’ chains, obtained by only considering the marginal chain $(\Theta_{k})_{k\geq 1}$ of the original chain $(\Theta_{k},Y_{k})_{k\geq 1}$ , on possibly unbounded state space domains.

6. Discussion and directions for future work

In this thesis, various old and new Monte Carlo estimators are presented. A defining feature of the estimators suggested is that they involve an IS type correction of samples drawn according to an intermediate approximate distribution. Basic convergence properties of the suggested estimators are established, and efficiency of these algorithms is studied and related to standard direct methods used hitherto commonly in practice.

There is still much interesting work that could be done in regards to the use of these estimators in different settings and with different approximations. Experimental results have been promising, and suggest further comparisons could be made, for example, of PIMH and its IS analogue (12). In the parameter inference setting, there have been many MCMC implementations making use of an approximation by applying delayed acceptance [see 2, Section 7.2], but very few using MCMC-IS (see [73] for one other non-academic example). One of the main goals of [1] is to bring attention to MCMC-IS, that it represents a viable approach, which enjoys flexibility in implementation and theoretical backing.

In the filtering and smoothing context, the approach for optimal selection of Feynman-Kac model for the smoothing problem [45] based on deterministic approximations, as used in [1] and further developed in [60] for an extended class of models, could be further developed. These approximations could also be based on various other non-linear filters and smoothers [cf. 84].

There are various directly applicable innovations which could be incorporated into MCMC-IS, and we mention a few. Quasi-Monte Carlo may be helpful in MCMC-IS, whether in the MCMC [cf. 86] or in the PF [35]. Work on exact simulation [12] techniques for diffusions [10, 11, 31] (see also the recent preprint [97]) and jump-diffusions [43] using continuous-time IS techniques is showing progress, and suggests parameter inference methods for partially observed versions could be developed, at least in the one-dimensional setting, using the MCMC-IS framework, with IS correction based on a PF using exact simulation dynamics, or based on other types of randomised weights, which may freely assume negative values in the IS correction.

It would be of interest to adapt the tuning guidelines [29] (see also [89]) for the PF when used in the PMMH, to the case when used within MCMC-IS. The formula (26) for the MCMC-IS asymptotic variance, which decomposes into marginal chain and IS correction parts, could also be useful in this regard. More generally, beyond PMMH, it would be beneficial to use better scaling MCMCs within MCMC-IS, for example, particle Gibbs [1], which is known to scale very well with backward sampling [cf. 57]. Additional annealing steps may be useful, as part of the Metropolis-Hastings with asymmetric acceptance ratio (MHAAR) approach [cf. 2]

In [2], more practical examples could be given showing bounds of likelihood ratios and usefulness of the results in practice. Further comparisons could be made, for example, with annealed IS [71] correction versus multi-stage DA [8]. Two different extensions of traditional DA correction were introduced in [2], and it would be interesting to study the stability properties of these new DA corrections, for example, along the lines of [5, 88]. Other more sophisticated reversible chains as in [2] with IS correction could be considered and compared. The effect of debiasing tricks [39] could be compared between MCMC-IS and pseudo-marginal type MCMC, where the coupling time integral to the debiasing approach may be considerably less for MCMC-IS if Phase 1 is based on deterministic approximation and Phase 2 involves noisy unbiased estimators.

There are many settings where there is a multilevel type structure and the debiasing techniques can be applied. In the joint inference setting, the IS-debiasing method as presented in [3] allows for an efficient debiasing strategy for joint inference using Euler approximations. The results could be generalised to Itô diffusions with time-dependent path-dependent coefficients, and to general resampling schemes in the $\Delta$ PF besides multinomial resampling. It would be nice to apply the IS-debiasing strategy in various settings, for example, to jump-diffusions [cf. 22, 54]. The coupling [53] and multilevel approach to the (unnormalised) smoothing problem [3], with possible randomised MLMC correction [64, 77], could be applied, for example, to the problem of calculation of normalisation constants [cf. 20] important for model selection. The optimal choice of coupled dynamics and potential could be studied, where we remark that the coupled potentials may be made level dependent, which is an additional degree of freedom. It would also be of interest to study stability and limit theorems [15, 18, 26] of these coupled PFs [cf. 55] based on change of reference measure and IS reweighting for use in unnormalised multilevel estimators as in [3].

We are currently looking into optimal tuning of the regularisation constant in the approximate likelihood estimator within MCMC-IS, which is connected to ‘defensive importance sampling’ [51]. The question of efficiency and proper allocation of resources of the MCMC-IS carries over to the multilevel and PF setting, where additionally multilevel aspects play a rôle. The question of optimal scaling particles versus level in the sub-canonical regime associated to Euler approximations was not entirely conclusive in [3]. It would be interesting to study this phenomenon in more depth. This may entail adapting the non-canonical CLT of [99] in the diffusion setting to the partially observed diffusion setting where number of particles and particle approximation variances are additional factors.

Applied in the ABC context in [98], the post-correction (or trimming) over a range of tolerances is a methodological approach applicable in other Monte Carlo settings where IS can be applied at small additional cost, for example, in the MLMC context, with the sum of multilevel increments computed sequentially over an increasing range of the fine tolerances, with corresponding plots. In such settings, it may also be possible to derive analogous approximate confidence intervals for the resulting estimators as in [4]. The tolerance adaptive ABC-MCMC in [4] was based on targeting a user-specified overall acceptance probability, and we chose a target close to the rule of thumb from the more general random walk PM literature [89]. It may be interesting to adapt the assumptions of [29, 89] to ones more resembling the ABC context, in order to find a perhaps different ‘rule of thumb’ for ABC-MCMC. The tolerance adaptation was also found to benefit the covariance adaptation during the burn-in, likely due to the improved mixing in the initial stages of the algorithm. It would be of interest to study this phenomenon and the interplay of different optimisation criteria in more depth, following, for example, the theoretical developments of adaptive MCMC as in [3, 4, 6].

The ‘proposal-rejection’ chains (42), which were considered for DA correction [2] and ‘proposal augmentation’ [4], are generalisations of Metropolis-Hastings chains, which include DA [8], PM [5], MHAAR [2] and marginalised ABC-MCMC [4]. Although ‘proposal-rejection’ chains technically include pseudo-marginal chains, we lay here particular emphasis on the possible course of study of (simpler) chains on the marginal (parameter) space, without auxiliary variable extensions like in pseudo-marginal MCMC. Many results worked out for Metropolis-Hastings can likely be extended to the marginal-space ‘proposal-rejection’ setting. For example, the waste-recyclers of [21, 82, 85], originally for Metropolis-Hastings, could be extended to ‘proposal-rejection’ chains. Some convergence analysis has been done for pseudo-marginal Metropolis-Hastings chains [cf. 5, 7] and some of this type of analysis could possibly be adapted to marginal-space ‘proposal-rejection’ chains. Following the line of argument of [52, 78], who show geometric ergodicity of symmetric random walk Metropolis-Hastings essentially if the target has exponential or lighter tails and a certain contour condition holds, it would be interesting to work out conditions for a similar type of result for the more general sub-class of marginal-space ‘proposal-rejection’ chains.

7. Summary of articles

7.1. \fortocArticle [A]: Importance sampling type estimators…\excepttocArticle [A]

Convergence properties are established for Markov chain Monte Carlo (MCMC) algorithms using an additional importance sampling (IS) type correction of approximate sample output of the Markov chain. Included is the interesting case where the approximate chain itself stems from a pseudo-marginal chain. The asymptotic variance from the proven central limit theorems is shown to decouple over the approximate marginal chain and the IS correction, which can be useful for questions of optimal allocation of computational resources. Particular strengths of the approach are highlighted, such as the efficient use of a jump chain, thinning, and straightforward parallelisation. Abstract properties of the augmented Markov chains corresponding to the MCMC-IS method are established. Experiments in state space models compare the MCMC-IS method with existing popular direct methods, and show the viability of the MCMC-IS approach in the state space models context.

7.2. \fortocArticle [B]: Importance sampling correction versus…\excepttocArticle [B]

The asymptotic variance of the MCMC-IS is compared to that of the direct MCMC methods. This is based on an extension of the existing covariance comparison result for direct chains to the context of comparison of one MCMC-IS to one direct chain. The extension also allows for use of unbiased estimators in the MCMC-IS Phase 1 and 2, as well as the use of a jump chain. Provided examples show that there can be no strict ordering between MCMC-IS and direct MCMC, as either may perform arbitrarily better than the other. Theoretical results are provided, which show upper and lower bounds for the MCMC-IS asymptotic approach in relation to an analogous direct MCMC method. The upper bound is satisfied in practice when approximations are reasonably accurate, and provides guarantees for the MCMC-IS asymptotic variance in terms of direct pseudo-marginal and delayed acceptance analogues. In the latent variable setting, this is the case in the sense of finite supremum norm of the ratio of likelihoods. Ergodicity and mixing of the MCMC-IS is shown to be less affected by noise of the Phase 2 unbiased estimators compared to pseudomarginal direct MCMC. The results help justify the viability of the MCMC-IS approach as a competing method to a direct approach.

7.3. \fortocArticle [C]: Unbiased inference for hidden Markov model…\excepttocArticle [C]

The question of joint inference for a challenging class of state space models is considered, where the underlying process is a diffusion process arising as a solution to a stochastic differential equation, which can not be simulated exactly. Noisy non-linear observations are obtained at some discrete points in time. Bayesian inference is performed using the IS debiasing approach, where, namely, an IS type correction, based on debiased multilevel Monte Carlo, a particle filter coupling, and Euler approximations, is used for an approximate MCMC targeting a coarse-model approximate distribution. Convergence of the method to the exact posterior is verified under standard conditions on the state space model and Euler type approximations found in the literature. From asymptotic efficiency and cost considerations, suggested allocations for computational resources are given, which help ensure efficient use of the algorithm.

7.4. \fortocArticle [D]: On the use of ABC-MCMC…\excepttocArticle [D]

The use of a slightly inflated tolerance is suggested in the context of approximate Bayesian computation (ABC) MCMC, along with subsequent post-correction based on trimming or IS correction of the sample output, over a (continuous) range of decreasing tolerances. Approximate confidence intervals for the resulting estimators are provided, which enjoy theoretical backing as well as good coverage in the experiments considered. An adaptive ABC-MCMC is also proposed, which finds a suitable (inflated) tolerance based on acceptance rate as the proxy. Convergence theorems for the adaptation under simple and more general conditions are provided. The tolerance adaptation worked well when used together with proposal covariance adaptation, in experiments which confirmed the suitability of the method based on adaptive ABC-MCMC and post-correction.

References

[1]

C. Andrieu, A. Doucet, and R. Holenstein.

Particle Markov chain Monte Carlo methods.

J. R. Stat. Soc. Ser. B Stat. Methodol., 72(3):269–342, 2010.

(with discussion).

[2]

C. Andrieu, A. Doucet, S. Yıldırım, and N. Chopin.

On the utility of Metropolis-Hastings with asymmetric acceptance ratio.

Preprint arXiv:1803.09527, 2018.

[3]

C. Andrieu and É. Moulines.

On the ergodicity properties of some adaptive MCMC algorithms.

Ann. Appl. Probab., 16(3):1462–1505, 2006.

[4]

C. Andrieu, É. Moulines, and P. Priouret.

Stability of stochastic approximation under verifiable conditions.

SIAM J. Control Optim., 44(1):283–312, 2005.

[5]

C. Andrieu and G. O. Roberts.

The pseudo-marginal approach for efficient Monte Carlo computations.

Ann. Statist., 37(2):697–725, 2009.

[6]

C. Andrieu and J. Thoms.

A tutorial on adaptive MCMC.

Statist. Comput., 18(4):343–373, Dec. 2008.

[7]

C. Andrieu and M. Vihola.

Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms.

Ann. Appl. Probab., 25(2):1030–1077, 04 2015.

[8]

M. Banterle, C. Grazian, A. Lee, and C. P. Robert.

Accelerating Metropolis-Hastings algorithms by delayed acceptance.

Preprint arXiv:1503.00996v2, 2015.

[9]

M. A. Beaumont.

Estimation of population growth or decline in genetically monitored populations.

Genetics, 164:1139–1160, 2003.

[10]

A. Beskos, O. Papaspiliopoulos, and G. O. Roberts.

Retrospective exact simulation of diffusion sample paths with applications.

Bernoulli, pages 1077–1098, 2006.

[11]

A. Beskos, O. Papaspiliopoulos, G. O. Roberts, and P. Fearnhead.

Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes.

J. R. Stat. Soc. Ser. B Stat. Methodol., 68(3):333–382, 2006.

(with discussion).

[12]

A. Beskos and G. Roberts.

Exact simulation of diffusions.

15(4):2422–2444, 11 2005.

[13]

R. J. Boys, D. J. Wilkinson, and T. B. Kirkwood.

Bayesian inference for a discretely observed stochastic kinetic model.

18(2):125–135, 2008.

[14]

O. Cappé, E. Moulines, and T. Rydén.

Inference in Hidden Markov Models.

Springer, 2005.

[15]

N. Chopin.

Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference.

32(6):2385–2411, 2004.

[16]

J. A. Christen and C. Fox.

Markov chain Monte Carlo using an approximation.

J. Comput. Graph. Statist., 14(4), 2005.

[17]

T. Cui, Y. Marzouk, and K. Willcox.

Scalable posterior approximations for large-scale Bayesian inverse problems via likelihood-informed parameter and state reduction.

J. Comput. Phys., 315:363–387, 2016.

[18]

P. Del Moral.

Feynman-Kac Formulae.

Springer, 2004.

[19]

P. Del Moral.

Mean field simulation for Monte Carlo integration.

Chapman & Hall/CRC, 2013.

[20]

P. Del Moral, A. Jasra, K. J. H. Law, and Y. Zhou.

Multilevel sequential Monte Carlo samplers for normalizing constants.

ACM Trans. on Modeling and Computer Simulation (TOMACS), 27(3):20, 2017.

[21]

J.-F. Delmas and B. Jourdain.

Does waste recycling really improve the multi-proposal Metropolis–Hastings algorithm? an analysis based on control variates.

46(4):938–959, 2009.

[22]

S. Dereich.

Multilevel Monte Carlo algorithms for Lévy-driven SDEs with Gaussian correction.

21(1):283–311, 2011.

[23]

P. Diaconis.

The Markov chain Monte Carlo revolution.

Bull. Amer. Math. Soc., 46(2):179–205, 2009.

[24]

H. Doss.

Discussion: Markov chains for exploring posterior distributions.

Ann. Statist., 22(4):1728–1734, 1994.

[25]

R. Douc, O. Cappé, and E. Moulines.

Comparison of resampling schemes for particle filtering.

In Proc. Image and Signal Processing and Analysis, 2005, pages 64–69, 2005.

[26]

R. Douc and E. Moulines.

Limit theorems for weighted samples with applications to sequential Monte Carlo methods.

In ESAIM: Proceedings, volume 19, pages 101–107. EDP Sciences, 2007.

[27]

R. Douc, C. P. Robert, et al.

A vanilla Rao-Blackwellization of Metropolis-Hastings algorithms.

Ann. Statist., 39(1):261–277, 2011.

[28]

A. Doucet and A. Johansen.

A tutorial on particle filtering and smoothing: Fifteen years later.

In Handbook of nonlinear filtering, volume 12, pages 656–704. 2009.

[29]

A. Doucet, M. Pitt, G. Deligiannidis, and R. Kohn.

Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator.

Biometrika, 102(2):295–313, 2015.

[30]

B. Efron and T. Hastie.

Computer age statistical inference.

Cambridge, 2016.

[31]

P. Fearnhead, K. Łatuszyński, G. Roberts, and G. Sermaidis.

Continious-time importance sampling: Monte Carlo methods which avoid time-discretisation error.

Preprint arXiv:1712.06201, 2017.

[32]

J. M. Flegal and G. L. Jones.

Batch means and spectral variance estimators in Markov chain Monte Carlo.

Ann. Statist., 38(2):1034–1070, 2010.

[33]

A. Gelman, J. Carlin, H. Stern, and D. Rubin.

Bayesian Data Analysis.

Chapman & Hall/CRC, 1995.

[34]

H.-O. Georgii.

Stochastics: introduction to probability and statistics.

Walter de Gruyter, 2012.

[35]

M. Gerber and N. Chopin.

Sequential quasi-Monte Carlo.

J. R. Stat. Soc. Ser. B Stat. Methodol., 77(3):509–579, 2015.

[36]

M. B. Giles.

Multilevel Monte Carlo path simulation.

Oper. Res., 56(3):607–617, 2008.

[37]

W. Gilks and G. Roberts.

Strategies for improving MCMC.

In Markov chain Monte Carlo in practice, volume 6, pages 89–114. 1996.

[38]

P. W. Glynn and D. L. Iglehart.

Importance sampling for stochastic simulations.

Management Science, 35(11):1367–1392, 1989.

[39]

P. W. Glynn and C.-H. Rhee.

Exact estimation for Markov chain equilibrium expectations.

51(A):377–389, 2014.

[40]

P. W. Glynn and W. Whitt.

The asymptotic efficiency of simulation estimators.

Oper. Res., 40(3):505–520, 1992.

[41]

A. Golightly, D. Henderson, and C. Sherlock.

Delayed acceptance particle MCMC for exact inference in stochastic kinetic models.

Statist. Comput., 25, 2015.

[42]

A. Golightly and D. Wilkinson.

Bayesian parameter inference for stochastic biochemical network models using particle Markov chain Monte Carlo.

Interface focus, 1(6):807–820, 2011.

[43]

F. Gonçalves, K. Łatuszyński, and G. Roberts.

Exact Monte Carlo likelihood-based inference for jump-diffusion processes.

Preprint arXiv:1707.00332, 2017.

[44]

N. J. Gordon, D. J. Salmond, and A. F. M. Smith.

Novel approach to nonlinear/non-Gaussian Bayesian state estimation.

IEE Proceedings-F, 140(2):107–113, 1993.

[45]

P. Guarniero, A. Johansen, and A. Lee.

The iterated auxiliary particle filter.

J. Amer. Statist. Assoc., 112(520):1636–1647, 2017.

[46]

H. Haario, E. Saksman, and J. Tamminen.

An adaptive Metropolis algorithm.

Bernoulli, 7(2):223–242, 2001.

[47]

F. Harris.

On the use of windows for harmonic analysis with the discrete Fourier transform.

Proc. IEEE, 66(1):51–83, 1978.

[48]

W. K. Hastings.

Monte Carlo sampling methods using Markov chains and their applications.

Biometrika, 57(1):97–109, Apr. 1970.

[49]

S. Heinrich.

Multilevel Monte Carlo methods.

In Large-scale scientific computing, pages 58–67. Springer, 2001.

[50]

J. Heng, A. Bishop, G. Deligiannidis, and A. Doucet.

Controlled sequential Monte Carlo.

Preprint arXiv:1708.08396v2, 2017.

[51]

T. Hesterberg.

Weighted average importance sampling and defensive mixture distributions.

Technometrics, 37(2):185–194, 1995.

[52]

S. F. Jarner and E. Hansen.

Geometric ergodicity of Metropolis algorithms.

Stochastic Process. Appl., 85(2):341–361, 2000.

[53]

A. Jasra, K. Kamatani, K. J. Law, and Y. Zhou.

Bayesian static parameter estimation for partially observed diffusions via multilevel Monte Carlo.

SIAM J. Sci. Comp., 40:A887–A902, 2018.

[54]

A. Jasra, K. J. Law, and P. P. Osei.

Multilevel particle filters for Lévy-driven stochastic differential equations.

Preprint arXiv:1804.04444, 2018.

[55]

A. Jasra and F. Yu.

Central limit theorems for coupled particle filters.

Preprint arXiv:1810.04900, 2018.

[56]

P. Kloeden and E. Platen.

Numerical solution of stochastic differential equations.

Springer, 3rd edition, 1999.

[57]

A. Lee, S. Singh, and M. Vihola.

Coupled conditional backward sampling particle filter.

(arXiv:1806.05852), 2018.

[58]

A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes.

On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods.

J. Comput. Graph. Statist., 19(4):769–789, 2010.

[59]

L. Lin, K. Liu, and J. Sloan.

A noisy Monte Carlo algorithm.

Phys. Rev. D, 61, 2000.

[60]

F. Lindsten, J. Helske, and M. Vihola.

Graphical model inference: Sequential Monte Carlo meets deterministic approximations.

In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8190–8200. Curran Associates, Inc., 2018.

[61]

J. S. Liu.

Monte Carlo strategies in scientific computing.

Springer-Verlag, New York, 2003.

[62]

P. Marjoram, J. Molitor, V. Plagnol, and S. Tavaré.

Markov chain Monte Carlo without likelihoods.

Proc. Natl. Acad. Sci. USA, 100(26):15324–15328, 2003.

[63]

P. Mathé and E. Novak.

Simple Monte Carlo and the Metropolis algorithm.

J. Complexity, 23(4-6):673–696, 2007.

[64]

D. McLeish.

A general method for debiasing a Monte Carlo estimator.

Monte Carlo Methods Appl., 17(4):301–315, 2011.

[65]

N. Metropolis, A. Rosenbluth, M. Rosenbluth, and A. Teller.

Equation of state calculations by fast computing machines.

Chem. Phys., 21(6), 1953.

[66]

S. Meyn and R. L. Tweedie.

Markov chains and stochastic stability.

Cambridge University Press, second edition, 2009.

[67]

A. Mira and C. Geyer.

Ordering Monte Carlo Markov Chains.

Technical report, School of Statistics, University of Minnesota, 1999.

[68]

A. Mira and F. Leisen.

Covariance ordering for discrete and continuous time Markov chains.

Statistica Sinica, pages 651–666, 2009.

[69]

J. Møller, A. Pettitt, R. Reeves, and K. Berthelsen.

An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants.

Biometrika, 93(2):451–458, 2006.

[70]

J. K. Møller and H. Madsen.

From state dependent diffusion to constant diffusion in stochastic differential equations by the Lamperti transform.

IMM-Technical Report-2010-16, 2010.

[71]

R. M. Neal.

Annealed importance sampling.

Statist. Comput., 11(2):125–139, 2001.

[72]

A. Owen.

Statistically efficient thinning of a Markov chain sampler.

J. Comput. Graph. Statist., (just-accepted), 2017.

[73]

P. Parpas, B. Ustun, M. Webster, and Q. K. Tran.

Importance sampling in stochastic programming: A Markov chain Monte Carlo approach.

INFORMS J. Comput., 27(2):358–377, 2015.

[74]

P. H. Peskun.

Optimum Monte-Carlo sampling using Markov chains.

Biometrika, 60(3):607–612, 1973.

[75]

M. Pitt and N. Shephard.

Filtering via simulation: auxiliary particle filters.

J. Amer. Statist. Assoc., 94(446):590–599, 1999.

[76]

D. Prangle.

Summary statistics.

In S. Sisson, Y. Fan, and M. Beaumont, editors, Handbook of Markov chain Monte Carlo, pages 125–152. Chapman & Hall/CRC, 2018.

[77]

C.-H. Rhee and P. W. Glynn.

Unbiased estimation with square root convergence for SDE models.

Oper. Res., 63(5):1026–1043, 2015.

[78]

G. Roberts and R. Tweedie.

Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms.

Biometrika, 83(1):95–110, 1996.

[79]

V. Roy, A. Tan, and J. M. Flegal.

Estimating standard errors for importance sampling estimators with multiple Markov chains.

Preprint arXiv:1509.06310, 2015.

[80]

W. Rudin.

Principles of mathematical analysis.

McGraw-Hill, 3rd edition, 1976.

[81]

D. Rudolf.

Explicit error bounds for Markov chain Monte Carlo.

Preprint arXiv:1108.3201, 2011.

[82]

D. Rudolf and B. Sprungk.

On a Metropolis-Hastings importance sampling estimator.

Preprint arXiv:1805.07174, 2018.

[83]

H. Rue, S. Martino, and N. Chopin.

Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations.

J. R. Stat. Soc. Ser. B Stat. Methodol., 71(2):319–392, 2009.

(with discussion).

[84]

S. Särkkä.

Bayesian filtering and smoothing.

Cambridge University Press, 2013.

[85]

I. Schuster and I. Klebanov.

Markov chain importance sampling - a highly efficient estimator for MCMC.

Preprint arXiv:1805.07179, 2018.

[86]

T. Schwedes and B. Calderhead.

Quasi Markov chain Monte Carlo methods.

Preprint arXiv:1807.00070, 2018.

[87]

G. Sermaidis, O. Papaspiliopoulos, G. Roberts, A. Beskos, and P. Fearnhead.

Markov chain Monte Carlo for exact inference for diffusions.

Scand. J. Statist, 40(2):294–321, 2013.

[88]

C. Sherlock and A. Lee.

Variance bounding of delayed-acceptance kernels.

Preprint arXiv:1706.02142, 2017.

[89]

C. Sherlock, A. H. Thiery, G. O. Roberts, and J. S. Rosenthal.

On the efficiency of pseudo-marginal random walk Metropolis algorithms.

Ann. Statist., 43(1):238–275, 2015.

[90]

A. N. Shiryaev.

Probability.

Springer-Verlag, New York, second edition, 1996.

[91]

S. Sisson and Y. Fan.

ABC samplers.

In S. Sisson, Y. Fan, and M. Beaumont, editors, Handbook of Markov chain Monte Carlo. Chapman & Hall/CRC, 2018.

[92]

S. A. Sisson, Y. Fan, and M. Beaumont.

Handbook of approximate Bayesian computation.

Chapman & Hall/CRC, 2018.

[93]

A. Solonen, P. Ollinaho, M. Laine, H. Haario, J. Tamminen, and H. Järvinen.

Efficient MCMC for climate model parameter estimation: parallel adaptive chains and early rejection.

Bayesian Analysis, 7(3):715–736, 2012.

[94]

L. Stewart and P. McCarty.

Use of Bayesian belief networks to fuse continuous and discrete information for target recognition, tracking, and situation assessment.

In Signal Processing, Sensor Fusion, and Target Recognition, volume 1699, pages 177–186, 1992.

[95]

L. Tierney.

A note on Metropolis-Hastings kernels for general state spaces.

Ann. Appl. Probab., 8(1):1–9, 1998.

[96]

M. Vihola.

Unbiased estimators and multilevel Monte Carlo.

Oper. Res., 66(2):448–462, 2018.

[97]

Q. Wang, V. Rao, and Y. W. Teh.

An exact auxiliary variable Gibbs sampler for a class of diffusions.

Preprint arXiv:1903.10659, 2019.

[98]

D. Wegmann, C. Leuenberger, and L. Excoffier.

Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihoods.

Genetics, 182(4):1207–1218, 2009.

[99]

Z. Zheng, J. Blanchet, and P. Glynn.

Rates of convergence and CLTs for subcanonical debiased MLMC.

In A. Owen and P. Glynn, editors, Monte Carlo and quasi-Monte Carlo methods. Springer Proc. Math. Stat., 2018.

[100]

K. Łatuszyński, B. Miasojedow, and W. Niemiro.

Nonasymptotic bounds on the estimation error of MCMC algorithms.

Bernoulli, 19(5A):2033–2066, 11 2013.

Bibliography104

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Vihola, J. Helske, and J. Franks. Importance sampling type estimators based on approximate marginal MCMC. Preprint ar Xiv:1609.02541 v 5, 2016.
2[2] J. Franks and M. Vihola. Importance sampling correction versus standard averages of reversible MCM Cs in terms of the asymptotic variance. Preprint ar Xiv:1706.09873 v 4, 2017.
3[3] J. Franks, A. Jasra, K. J. H. Law and M. Vihola. Unbiased inference for discretely observed hidden Markov model diffusions. Preprint ar Xiv:1807.10259 v 4, 2018.
4[4] M. Vihola and J. Franks. On the use of ABC-MCMC with inflated tolerance and post-correction. Preprint ar Xiv:1902.00412, 2019.
5[1] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. J. R. Stat. Soc. Ser. B Stat. Methodol. , 72(3):269–342, 2010. (with discussion).
6[2] C. Andrieu, A. Doucet, S. Yıldırım, and N. Chopin. On the utility of Metropolis-Hastings with asymmetric acceptance ratio. Preprint ar Xiv:1803.09527, 2018.
7[3] C. Andrieu and É. Moulines. On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab. , 16(3):1462–1505, 2006.
8[4] C. Andrieu, É. Moulines, and P. Priouret. Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim. , 44(1):283–312, 2005.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

Contents

List of symbols

Foreword

List of included articles

1. Introduction

1.1. Likelihoods

1.2. Bayesian inference

1.3. Challenges for inference

1.4. Approximate families

1.5. Overview

2. Bayesian inference for state space models on path space

2.1. Discretely-observed continuous-time path-dependent process

2.2. Model probabilities

2.3. Particle filter

2.4. Particle marginal Metropolis-Hastings

3. Accelerations based on an approximation

3.1. Delayed acceptance and importance sampling

3.2. The question of relative efficiency

3.3. Peskun and covariance orderings of asymptotic variances

3.4. Peskun type ordering for importance sampling correction

3.5. Comparison results

4. Bayesian inference for state space models with diffusion dynamics

4.1. Euler approximations

4.2. Multilevel Monte Carlo

4.3. Coupling of Feynman-Kac models

4.4. Debiasing techniques

4.5. Joint inference using importance sampling type correction

4.6. Computational efficiency and allocations

5. Inference via approximate Bayesian computation

5.1. Choosing the tolerance in ABC-MCMC

5.2. Approximate confidence intervals

5.3. Adaptive ABC-MCMC with post-correction

5.4. Convergence of the tolerance adaptive ABC-MCMC

6. Discussion and directions for future work

7. Summary of articles

7.1. \fortocArticle [A]: Importance sampling type estimators…\excepttocArticle [A]

7.2. \fortocArticle [B]: Importance sampling correction versus…\excepttocArticle [B]

7.3. \fortocArticle [C]: Unbiased inference for hidden Markov model…\excepttocArticle [C]

7.4. \fortocArticle [D]: On the use of ABC-MCMC…\excepttocArticle [D]

References