Variance bounding of delayed-acceptance kernels

Chris Sherlock; Anthony Lee

arXiv:1706.02142·math.ST·November 12, 2021

Variance bounding of delayed-acceptance kernels

Chris Sherlock, Anthony Lee

PDF

TL;DR

This paper investigates conditions under which delayed-acceptance Metropolis-Hastings algorithms inherit variance bounding properties from their parent kernels, improving computational efficiency in Bayesian inference.

Contribution

It provides sufficient conditions for delayed-acceptance kernels to inherit variance bounding, enhancing understanding of their efficiency in computationally expensive Bayesian inference.

Findings

01

Delayed-acceptance kernels can be variance bounding under certain conditions.

02

Bounded discrepancy between approximate and true log densities ensures inheritance.

03

Sufficient conditions for proposal pairs to preserve variance bounding property.

Abstract

A delayed-acceptance version of a Metropolis--Hastings algorithm can be useful for Bayesian inference when it is computationally expensive to calculate the true posterior, but a computationally cheap approximation is available; the delayed-acceptance kernel targets the same posterior as its associated "parent" Metropolis-Hastings kernel. Although the asymptotic variance of the ergodic average of any functional of the chain cannot be less than that obtained using its parent, the average computational time per iteration can be much smaller and so for a given computational budget the delayed-acceptance kernel can be more efficient. When the asymptotic variance of the ergodic averages of all $L^{2}$ functionals of the chain is finite, the kernel is said to be variance bounding. It has recently been noted that a delayed-acceptance kernel need not be variance bounding even when its parent is.…

Tables1

Table 1. Table 1: Whether or not the DA algorithm for π ( x ) ∝ e − x β 𝟙 ( x > 0 ) proportional-to 𝜋 𝑥 superscript 𝑒 superscript 𝑥 𝛽 1 𝑥 0 \pi(x)\propto e^{-x^{\beta}}\mathds{1}(x>0) using π ^ ( x ) ∝ e − x γ 𝟙 ( x > 0 ) proportional-to ^ 𝜋 𝑥 superscript 𝑒 superscript 𝑥 𝛾 1 𝑥 0 \hat{\pi}(x)\propto e^{-x^{\gamma}}\mathds{1}(x>0) is variance bounding as a function of γ 𝛾 \gamma and β 𝛽 \beta and the specific DA algorithm. The final two columns indicate that π ^ ^ 𝜋 \hat{\pi} rather than π 𝜋 \pi is used to create the proposal.

Algorithm	MALA	RWM/TMALA	MALA ( $\nabla \log \hat{π}$ )	TMALA ( $\nabla \log \hat{π}$ )
$1 \leq γ < β \leq 2$	$\times$ Ex 6	✓ Ex 5	✓ Ex 8	✓ Ex 7
$1 \leq β < γ$	$\times$ Ex 6	$\times$ Ex 9	$\times$ Ex 8	$\times$ Ex 9

Equations118

\mbox v a r (h, P) := n \to \infty lim n \mbox V a r [\frac{1}{n} i = 1 \sum n h (X_{i})],

\mbox v a r (h, P) := n \to \infty lim n \mbox V a r [\frac{1}{n} i = 1 \sum n h (X_{i})],

∣ P^{n} (x, A) - π (A) ∣ \leq M (x) ρ^{n}

∣ P^{n} (x, A) - π (A) ∣ \leq M (x) ρ^{n}

r (x, y) := \frac{π ( y ) q ( y , x )}{π ( x ) q ( x , y )} .

r (x, y) := \frac{π ( y ) q ( y , x )}{π ( x ) q ( x , y )} .

P (x, \mbox d y) := q (x, y) \mbox d y α (x, y) + [1 - \overline{α} (x)] δ_{x} (\mbox d y) .

P (x, \mbox d y) := q (x, y) \mbox d y α (x, y) + [1 - \overline{α} (x)] δ_{x} (\mbox d y) .

r_{1} (x, y) := \frac{π ^ ( y ) q ( y , x )}{π ^ ( x ) q ( x , y )} \mbox an d r_{2} (x, y) := \frac{π ( y ) / π ^ ( y )}{π ( x ) / π ^ ( x )} .

r_{1} (x, y) := \frac{π ^ ( y ) q ( y , x )}{π ^ ( x ) q ( x , y )} \mbox an d r_{2} (x, y) := \frac{π ( y ) / π ^ ( y )}{π ( x ) / π ^ ( x )} .

\tilde{P} (x, \mbox d y) := q (x, y) \mbox d y \tilde{α} (x, y) + [1 - \overline{α} (x)] δ_{x} (\mbox d y) .

\tilde{P} (x, \mbox d y) := q (x, y) \mbox d y \tilde{α} (x, y) + [1 - \overline{α} (x)] δ_{x} (\mbox d y) .

r_{1} := f \in L_{0}^{2} (π) in f \frac{⟨ f , P f ⟩}{⟨ f , f ⟩} \geq - 1 \mbox an d r_{2} := f \in L_{0}^{2} (π) sup \frac{⟨ f , P f ⟩}{⟨ f , f ⟩} \leq 1,

r_{1} := f \in L_{0}^{2} (π) in f \frac{⟨ f , P f ⟩}{⟨ f , f ⟩} \geq - 1 \mbox an d r_{2} := f \in L_{0}^{2} (π) sup \frac{⟨ f , P f ⟩}{⟨ f , f ⟩} \leq 1,

κ (A) := \frac{1}{π ( A )} \int_{A} P (x, A^{c}) π (\mbox d x) .

κ (A) := \frac{1}{π ( A )} \int_{A} P (x, A^{c}) π (\mbox d x) .

κ := A : 0 < π (A) \leq 1/2 in f κ (A) .

κ := A : 0 < π (A) \leq 1/2 in f κ (A) .

P \mbox i s v a r ian ce b o u n d in g

P \mbox i s v a r ian ce b o u n d in g

\Leftrightarrow P \mbox ha s a p os i t i v eco n d u c t an ce \mbox (T h r m 2.1 o f \cite [c i t e] \@@bibref A u t h or s P h r a se 1 Y e a r P h r a se 2 L a w l er S o k a l : 1988 \@@citephrase (\@@citephrase))

\Leftarrow P \mbox ha s a s p ec t r a l g a p

\Leftrightarrow P \mbox i s g eo m e t r i c a l l y er g o d i c \mbox (T h r m 2.1 o f \cite [c i t e] \@@bibref A u t h or s P h r a se 1 Y e a r P h r a se 2 R n R 1997 \@@citephrase (\@@citephrase)) .

q (x, y) = N (y; x + \frac{1}{2} λ^{2} R (x), λ^{2} I), \mbox w h er e R (x) = \frac{D \nabla lo g π}{D \lor ∣ ∣ \nabla lo g π ∣ ∣},

q (x, y) = N (y; x + \frac{1}{2} λ^{2} R (x), λ^{2} I), \mbox w h er e R (x) = \frac{D \nabla lo g π}{D \lor ∣ ∣ \nabla lo g π ∣ ∣},

\tilde{P} \mbox i s g eo m e t r i c a l l y er g o d i c ⟺ \tilde{P} \mbox i s v a r ian ce b o u n d in g .

\tilde{P} \mbox i s g eo m e t r i c a l l y er g o d i c ⟺ \tilde{P} \mbox i s v a r ian ce b o u n d in g .

\int_{D (x)^{∁}} q_{A} (x, y) α_{A} (x, y) \mbox d y \leq ϵ,

\int_{D (x)^{∁}} q_{A} (x, y) α_{A} (x, y) \mbox d y \leq ϵ,

y \in D (x) \Rightarrow q_{B} (x, y) α_{B} (x, y) \geq δ q_{A} (x, y) α_{A} (x, y),

y \in D (x) \Rightarrow q_{B} (x, y) α_{B} (x, y) \geq δ q_{A} (x, y) α_{A} (x, y),

\exists r < \infty \mbox s u c h t ha t \mbox f or a l l x \in X, \int_{B (x, r)^{c}} q (x, y) \mbox d y < ϵ .

\exists r < \infty \mbox s u c h t ha t \mbox f or a l l x \in X, \int_{B (x, r)^{c}} q (x, y) \mbox d y < ϵ .

Δ (x, y; q) := lo g q (y, x) - lo g q (x, y) .

Δ (x, y; q) := lo g q (y, x) - lo g q (x, y) .

∣Δ (x, y; q_{B}) - Δ (x, y; q_{A}) ∣ \leq h (∣ ∣ y - x ∣ ∣),

∣Δ (x, y; q_{B}) - Δ (x, y; q_{A}) ∣ \leq h (∣ ∣ y - x ∣ ∣),

\tilde{α} (x, y)

\tilde{α} (x, y)

\mbox or \tilde{α} (x, y)

M_{m} (x)

M_{m} (x)

C (x)

M_{m} (x) \cap D (x) \subseteq C (x),

M_{m} (x) \cap D (x) \subseteq C (x),

\int_{D (x)^{c}} q (x, y) \mbox d y \leq ϵ .

\int_{D (x)^{c}} q (x, y) \mbox d y \leq ϵ .

\exists r_{*} > 0 \mbox s u c h t ha t i f ∣ ∣ x ∣ ∣ > r_{*} \mbox an d ∣ ∣ y ∣ ∣ > r_{*} \mbox t h e n \overset{π}{^} (x) \leq \overset{π}{^} (y) \Rightarrow \frac{π ^ ( x )}{π ( x )} \geq \frac{π ^ ( y )}{π ( y )} .

\exists r_{*} > 0 \mbox s u c h t ha t i f ∣ ∣ x ∣ ∣ > r_{*} \mbox an d ∣ ∣ y ∣ ∣ > r_{*} \mbox t h e n \overset{π}{^} (x) \leq \overset{π}{^} (y) \Rightarrow \frac{π ^ ( x )}{π ( x )} \geq \frac{π ^ ( y )}{π ( y )} .

{y \in X : ∣ lo g r_{2} (x, y) ∣ > h (∣ ∣ y - x ∣ ∣)} \subseteq C (x) .

{y \in X : ∣ lo g r_{2} (x, y) ∣ > h (∣ ∣ y - x ∣ ∣)} \subseteq C (x) .

{y \in D (x) : ∣Δ (x, y; q) ∣ > h (∣ ∣ y - x ∣ ∣)} \subseteq C (x) .

{y \in D (x) : ∣Δ (x, y; q) ∣ > h (∣ ∣ y - x ∣ ∣)} \subseteq C (x) .

r_{1 b} (x, y) := \frac{π ^ ( y ) q ^ ( y , x )}{π ^ ( x ) q ^ ( x , y )} .

r_{1 b} (x, y) := \frac{π ^ ( y ) q ^ ( y , x )}{π ^ ( x ) q ^ ( x , y )} .

\tilde{P}_{b} (x, d y) := \overset{q}{^} (x, y) \mbox d y \tilde{α}_{b} (x, y) + [1 - \overline{α}_{b} (x)] δ_{x} (d y) .

\tilde{P}_{b} (x, d y) := \overset{q}{^} (x, y) \mbox d y \tilde{α}_{b} (x, y) + [1 - \overline{α}_{b} (x)] δ_{x} (d y) .

P_{hyp} (x, \mbox d y) := \overset{q}{^} (x, y) \mbox d y α_{hyp} (x, y) + [1 - \overline{α}_{hyp} (x))] δ_{x} (d y) .

P_{hyp} (x, \mbox d y) := \overset{q}{^} (x, y) \mbox d y α_{hyp} (x, y) + [1 - \overline{α}_{hyp} (x))] δ_{x} (d y) .

π (x) \propto exp (- ∣∣ x ∣ ∣^{β}) \mbox an d \overset{π}{^} (x) \propto exp (- ∣∣ x ∣ ∣^{γ} / κ^{γ}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Variance bounding of delayed-acceptance kernels

Chris Sherlock1 and Anthony Lee2

(1Department of Mathematics and Statistics, Lancaster University, Lancaster, LA1 4YF, UK; ORCID: 0000-0002-2429-3157; [email protected].

2School of Mathematics, Fry Buiding, University of Bristol, Bristol, BS8 1UG, UK; ORCID: 0000-0001-7765-0616.

)

Abstract

A delayed-acceptance version of a Metropolis–Hastings algorithm can be useful for Bayesian inference when it is computationally expensive to calculate the true posterior, but a computationally cheap approximation is available; the delayed-acceptance kernel targets the same posterior as its associated “parent” Metropolis-Hastings kernel. Although the asymptotic variance of the ergodic average of any functional of the delayed-acceptance chain cannot be less than that obtained using its parent, the average computational time per iteration can be much smaller and so for a given computational budget the delayed-acceptance kernel can be more efficient.

When the asymptotic variance of the ergodic averages of all $L^{2}$ functionals of the chain are finite, the kernel is said to be variance bounding. It has recently been noted that a delayed-acceptance kernel need not be variance bounding even when its parent is. We provide sufficient conditions for inheritance: for non-local algorithms, such as the independence sampler, the discrepancy between the log density of the approximation and that of the truth should be bounded; for local algorithms, two alternative sets of conditions are provided.

As a by-product of our initial, general result we also supply sufficient conditions on any pair of proposals such that, for any shared target distribution, if a Metropolis-Hastings kernel using one of the proposals is variance bounding then so is the Metropolis-Hastings kernel using the other proposal.

Keywords: Metropolis-Hastings; delayed-acceptance; variance bounding; conductance; geometric ergodicity.

**AMS:**Primary: 60J10; Secondary: 65C40;47A10

Declarations: funding - none; conflicts of interest - none; availability of data and material - n/a; code availability - R code to produce the plots in Section 5 is available from https://chrisgsherlock.github.io/Research/publications.html.

1 Introduction

The Metropolis-Hastings (MH) algorithm is widely used to approximately compute expectations with respect to complicated high-dimensional posterior distributions (e.g. Gilks et al.,, 1996; Geyer,, 2011). The algorithm requires that it be possible to evaluate point-wise the density of the distribution of interest (throughout this article, all densities are with respect to Lebesgue measure) up to an arbitrary constant of proportionality.

In many problems the target posterior density is computationally expensive to evaluate. When a computationally-cheap approximation, or surrogate, is available, the delayed-acceptance Metropolis-Hastings (DAMH) algorithm (Liu,, 2001; Christen and Fox,, 2005; Higdon et al.,, 2011, also known as the two-stage algorithm, and a special case of the surrogate-transition method) leverages the surrogate to produce a new Markov chain that still targets the original distribution of interest. A first ‘screening’ stage substitutes the surrogate density for the true density in the standard formula for the MH acceptance probability; proposals which fail at this stage are discarded. Only proposals that pass the first stage are considered in the second ‘correction’ stage, where it is necessary to evaluate the true posterior density at the proposed value.

Delayed acceptance (DA) algorithms have been applied in a variety of settings with the approximate density obtained in a variety of different ways, for example: a coarsening of a numerical grid in Bayesian inverse problems Christen and Fox, (2005); Moulton et al., (2008); Cui et al., (2011), subsampling from big-data Payne and Mallick, (2014); Banterle et al., (2019); Quiroz et al., (2018), a tractable approximation to a stochastic process Smith, (2011); Golightly et al., (2015), or a direct, nearest-neighbour approximation to the truth using previous values Sherlock et al., (2017).

For a Markov kernel, $P$ , with a stationary distribution of $\pi$ , and an associated chain $\{X_{t}\}_{t=1}^{\infty}$ , the asymptotic variance of any functional, $h$ , is defined to be

[TABLE]

where $X_{1}\sim\pi$ . A lower asymptotic variance is thus associated, in practice, with a greater accuracy in estimating $\mathbb{E}_{{\pi}}\left[{h(X)}\right]$ using a realisation of length $n>>1$ from the distribution of the chain. In terms of the asymptotic variance of any functional of the chain, the DAMH kernel cannot be more efficient than the parent MH kernel; however the computational cost per iteration is, typically, reduced considerably. The almost-negligible computational cost of the screening stage also, typically, facilitates proposals that have a larger chance of being rejected than the MH proposal, but where the pay-off on acceptance is so much larger that the expected overall movement per unit of time increases. When efficiency is measured in terms of effective samples per second, gains of over an order of magnitude have been reported (e.g. Golightly et al.,, 2015).

A Markov kernel $P$ with a stationary distribution of $\pi$ is termed variance bounding if $\mbox{{var}}(h,P)<\infty$ for all $h\in L^{2}(\pi)$ , the Hilbert space of functions that are square-integrable with respect to $\pi$ . Equivalently there exists $K<\infty$ such that $\mbox{{var}}(h,P)\leq K\mbox{Var}_{\pi}[h]$ for all such $h$ . This property was named and studied in Roberts and Rosenthal, (2008), where it was shown to be equivalent to the existence of a ‘usual’ central limit theorem (CLT); that is, a CLT where the limiting variance is the asymptotic variance.

Intuitively, the variance-bounding property embodies desirable behaviour for a chain started at equilibrium. In practice, the chain is not started at equilibrium, but asymptotically the bias that results from this is negligible compared with the variance. An alternative natural requirement is that the chain converge to equilibrium geometrically quickly (rather than, say, polynomially quickly). A Markov chain kernel, $P$ , with stationary distribution $\pi$ is geometrically ergodic (e.g. Roberts and Rosenthal,, 1997, 2004; Meyn and Tweedie,, 1993, Chapter 15) if there exist $\rho>0$ and $M:\mathcal{X}\rightarrow[0,\infty)$ that is finite $\pi$ -almost everywhere, such that

[TABLE]

for all $\mathcal{A}\in\mathcal{F},~{}x\in\mathcal{X}$ and $n\in\mathbb{N}$ , where $P^{n}$ denotes the $n$ -step transition kernel.

Although the motivations behind the definitions of variance bounding and geometric ergodicity, mixing at equilibrium and convergence to equilibrium, are quite different, for a large class of algorithms, including those studied in this article, these two properties are very closely linked as we will describe in Section 2.2. Indeed, for delayed-acceptance algorithms, under weak conditions the two properties are equivalent (see Proposition 1).

Theoretical properties of the efficiency of delayed-acceptance algorithms have been studied in Banterle et al., (2019), Sherlock et al., (2021) and Franks and Vihola, (2020). The first contribution from Banterle et al., (2019) is an example delayed-acceptance algorithm which fails to inherit geometric ergodicity from its parent Metropolis-Hastings algorithm (see Example 1 in Section 2.3 of this article); a simple sufficient condition for inheritance of geometric ergodicity, uniformly good behaviour of the ratios $\mathsf{r}_{1}$ and $\mathsf{r}_{2}$ that we define in (3), is also supplied. Finally, an idealised setting where the cheap approximation is perfectly accurate is explored to obtain tuning guidelines for $\lambda$ in the delayed-acceptance random walk Metropolis algorithm. Sherlock et al., (2021) examines this tuning issue further, proving a limiting diffusion for the first component of the delayed-acceptance Markov chain, and providing robust tuning guidelines that account for the error in the cheap approximation; the article then extends these guidelines to the pseudo-marginal version of the algorithm. Finally, Franks and Vihola, (2020) compares the asymptotic variance of a general pseudo-marginal delayed-acceptance algorithm with the variance of an algorithm that applies importance-sampling to the output of an MCMC algorithm targeting the cheap approximation directly.

Using our Proposition 1, the lack of inheritance of geometric ergodicity in the example in Banterle et al., (2019) is equivalent to a lack of inheritance of the variance bounding property: even though the asymptotic variance using the parent MH kernel is finite for all $h\in L^{2}(\pi)$ , there exist $h\in L^{2}(\pi)$ for which the asymptotic variance using the DA kernel is infinite. For such $h$ , estimated quantities such as effective sample size (e.g. Hoff,, 2009) are invalid, and consequent, standard CLT-based intuitions about the sizes of typical errors in estimates of $\mathbb{E}_{\pi}[h]$ from the chain do not hold.

We investigate the conditions under which a DAMH kernel inherits variance bounding from its MH parent and, as a by product, discover conditions under which two different proposals produce MH kernels that are equivalent in terms of whether or not they are variance bounding. Section 2 provides the background and two motivating examples, while Section 3 provides some key definitions, a general inheritance result applicable to all propose-accept-reject kernels, and sufficient conditions for variance-bounding equivalence between two Metropolis-Hastings proposals. Section 4 contains our results for standard DA algorithms with further illustrative examples, and includes parent MH algorithms where the proposal depends upon the form of the density, so that the proposal for a computationally cheap DA kernel would naturally depend on the surrogate. Numerical experiments are performed in Section 5 and the article concludes with a discussion. All proofs are deferred to Appendix A.

2 Background, notation and motivation

Throughout this article all Markov chains are assumed to be on a statespace $(\mathcal{X},\mathcal{F})$ , with $\mathcal{X}\subseteq\mathbb{R}^{d}$ Lebesgue measurable, and $\mathcal{F}$ the $\sigma$ -algebra of all Lebesgue-measurable sets in $\mathcal{X}$ . The target and surrogate distributions are denoted by $\pi$ and $\hat{\pi}$ , respectively, and they are assumed to have densities of $\pi(x)$ and $\hat{\pi}(x)$ with respect to Lebesgue measure.

2.1 Metropolis-Hastings and delayed-acceptance kernels

The Metropolis-Hastings kernel has a proposal density $q(x,y)$ and an acceptance probability $\alpha(x,y)=1\wedge\mathsf{r}(x,y)$ where

[TABLE]

With $\overline{\alpha}(x):=\int\alpha(x,y)q(x,y)\mbox{d}y$ , the Metropolis-Hastings (MH) kernel is then

[TABLE]

An iteration of the corresponding MH algorithm proceeds from a current value, $x$ , to the next value, $y$ , as follows. A value $x^{\prime}$ is sampled from the distribution with a density of $q(x,x^{\prime})$ . With a probability of $\alpha(x,x^{\prime})$ , $y\leftarrow x^{\prime}$ , else $y\leftarrow x$ .

Now, suppose that we have an approximation, $\hat{\pi}(x)$ , to $\pi(x)$ . The standard delayed-acceptance kernel uses the same proposal, $q(x,y)$ , but has an acceptance probability of $\tilde{\alpha}(x,y)=[1\wedge\mathsf{r}_{1}(x,y)][1\wedge\mathsf{r}_{2}(x,y)]$ , where

[TABLE]

With $\overline{\widetilde{\alpha}}(x):=\int\tilde{\alpha}(x,y)q(x,y)\mbox{d}y$ , the delayed-acceptance (DA) kernel is

[TABLE]

An iteration of the corresponding DA algorithm proceeds from a current value, $x$ , to a next value, $y$ , as follows.

Stage One:

A value $x^{\prime}$ is sampled from the distribution with a density of $q(x,x^{\prime})$ . With a probability of $1\wedge\mathsf{r}_{1}(x,x^{\prime})$ the algorithm proceeds to Stage Two, else $y\leftarrow x$ .

Stage Two:

With a probability of $1\wedge\mathsf{r}_{2}(x,x^{\prime})$ , $y\leftarrow x^{\prime}$ , else $y\leftarrow x$ .

Now, $\tilde{\alpha}(x,y)\leq\alpha(x,y)$ , and so $\mbox{{var}}(h,\tilde{\mathsf{P}})\geq\mbox{{var}}(h,\mathsf{P})$ for each $h\in L^{2}(\pi)$ Peskun, (1973); Tierney, (1998). At first glance this might suggest that the DA algorithm is never worthwhile; however for any proposal that is rejected at Stage One there is no need to complete the expensive calculation of $\pi(x^{\prime})$ that is required at every iteration of the MH algorithm and in Stage Two of the DA algorithm. As mentioned in the Introduction, for a fixed computational time, the decreased average computational cost per iteration, and alterations of any tuning parameters to take advantage of this, can lead to a DA algorithm where the variance of an estimator can be over an order of magnitude smaller than that of the MH algorithm.

Since $\mbox{{var}}(h,\tilde{\mathsf{P}})\geq\mbox{{var}}(h,\mathsf{P})$ , if $\tilde{\mathsf{P}}$ is variance bounding then so is $\mathsf{P}$ ; however it is feasible that $\mathsf{P}$ may be variance bounding while $\tilde{\mathsf{P}}$ is not.

2.2 Key terminology, equivalences and implications

The MH and DA kernels are both reversible with respect to the target. A kernel $P$ is reversible with respect to a distribution $\pi$ iff for all $\mathcal{A}\in\mathcal{F}$ and $\mathcal{B}\in\mathcal{F}$ , $\int_{\mathcal{A}}\pi(\mbox{d}x){P}(x,\mathcal{B})=\int_{\mathcal{B}}\pi(\mbox{d}x){P}(x,\mathcal{A}).$ This article utilises a number of existing results for reversible Markov chains on the relationship between variance bounding, conductance, spectral gaps and geometric ergodicity. Here we define conductance and spectral gaps and summarise the relationships between the four properties.

Define the Hilbert space $L_{0}^{2}(\pi)=\{f:\mathcal{X}\rightarrow\mathbb{R};\mathbb{E}_{{\pi}}\left[{f(X)}\right]=0,~{}\mathbb{E}_{{\pi}}\left[{f^{2}(X)}\right]<\infty\}$ with the inner product $\langle f,g\rangle=\int_{\mathcal{X}}f(x)g(x)\pi(\mbox{d}x)$ , and consider $P$ as an operator acting on $L_{0}^{2}(\pi)$ according to $(Pf)(x)=\int_{\mathcal{X}}P(x,\mbox{d}y)f(y)$ . If $P$ is reversible then it is a self adjoint operator on $L_{0}^{2}(\pi)$ and by the spectral theorem for bounded self-adjoint operators, for each $f\in L_{0}^{2}(\pi)$ , $\langle f,P^{n}f\rangle=\int_{-1}^{1}\lambda^{n}H_{f}(\mbox{d}\lambda)$ for some positive measure $H_{f}$ on $[-1,1]$ . Let

[TABLE]

or, equivalently (e.g. Yosida,, 1980, p320, Theorem 2), that the smallest closed interval containing the support of $H_{f}$ for all $f\in L_{0}^{2}(\pi)$ is $[r_{1},r_{2}]$ . The spectral gap of $P$ is $1-\max(|r_{1}|,|r_{2}|)$ (e.g. Geyer,, 1992; Roberts and Rosenthal,, 1997), the right spectral gap is $1-r_{2}$ and the left spectral gap is $1+r_{1}$ . $P$ is said to have a spectral gap (or a left or right spectral gap) if its spectral gap (or left or right spectral gap) is non-zero.

For any set $\mathcal{A}\in\mathcal{F}$ with $\pi(\mathcal{A})>0$ consider the probability of leaving $\mathcal{A}$ at the next iteration given that the stationary chain is currently in $\mathcal{A}$ :

[TABLE]

The conductance, $\kappa$ for a Markov kernel $P$ with invariant measure $\pi$ is then (e.g. Lawler and Sokal,, 1988) (see also Jerrum and Sinclair,, 1988)

[TABLE]

For any reversible Markov chain we have the following relationships:

[TABLE]

These relationships will be used repeatedly in the sequel without further reference.

2.3 Example algorithms

To exemplify our theoretical results we will consider four specific, frequently-used MH algorithms.

The Metropolis-Hastings independence sampler (MHIS): $q(x,y)=q(y)$ . 2. 2.

The random walk Metropolis (RWM): $q(x,y)=q(x-y)=q(y-x)$ ; e.g. $q(x,y)=\mathsf{N}(y;x,\lambda^{2}I)$ . 3. 3.

The Metropolis-adjusted Langevin algorithm (MALA): $q(x,y)=\mathsf{N}(y;x+\frac{1}{2}\lambda^{2}\nabla\log\pi,\lambda^{2}I)$ . 4. 4.

The truncated MALA:

[TABLE]

for some $D>0$ .

In Proposals of type 2, 3 and 4, $\lambda$ is often referred to as the scale parameter of the proposal. The MHIS and RWM have been used since the early days of MCMC (e.g. Tierney,, 1994); conditions under which they are geometrically ergodic (and, hence, variance bounding) have been well studied; see, for example, Liu, (1996) and Mengersen and Tweedie, (1996) for the MHIS and Mengersen and Tweedie, (1996), Roberts and Tweedie, 1996b and Jarner and Hansen, (2000) for the RWM. Essentially, for the MHIS the proposal, $q$ , must not have lighter tails than the target, and for the RWM the target must have suffiently smooth and exponentially decreasing tails. The MALA was introduced in Besag, (1994) and was analysed in Roberts and Tweedie, 1996a , in which the truncated MALA was also introduced. The MALA can be much more efficient than the RWM in moderate to high dimensions. As with the RWM, for geometric ergodicity the MALA requires exponentially decreasing tails, but if the tails decrease too quickly, $\left|\left|{\nabla\log\pi}\right|\right|$ grows too quickly and the MALA can fail to be geometrically ergodic. The truncated MALA circumvents this problem.

In Banterle et al., (2019) it is shown that the geometric ergodicity of an RWM algorithm need not be inherited by the resulting DA algorithm.

Example 1.

Banterle et al., (2019) Let $\mathcal{X}=\mathbb{R}$ with $\pi(x)\propto e^{-x^{2}/2}$ and $q(x,y)\propto e^{-(y-x)^{2}/(2\lambda^{2})}$ . If $\hat{\pi}(x)\propto e^{-x^{2}/(2\sigma^{2})}$ , with $\sigma^{2}<1$ then $\tilde{\mathsf{P}}$ is not geometrically ergodic.

The following conditional equivalence (proved in Section A.2) is used throughout the sequel. If the parent kernel is geometrically ergodic then the DA kernel must have a left spectral gap, and with this constraint geometric ergodicity and variance bounding are equivalent.

Proposition 1.

Let $\mathsf{P}$ be a MH kernel targeting $\pi$ as specified in (2). Let $\tilde{\mathsf{P}}$ be the DA kernel derived from this through the approximation $\hat{\pi}$ as in (4). If $\mathsf{P}$ is geometrically ergodic then

[TABLE]

The original random walk Metropolis algorithm on $\pi(x)$ is geometrically ergodic Mengersen and Tweedie, (1996), and hence variance bounding, so the DA kernel in Example 1 has not inherited its parent’s desirable properties. As a direct corollary of our Theorem 3 (see Section 4.2) we find that $\sigma^{2}\geq 1$ is exactly the right condition in this case:

Example 2.

Let $\mathcal{X}=\mathbb{R}$ with $\pi(x)\propto e^{-x^{2}/2}$ and $q(x,y)\propto e^{-(y-x)^{2}/(2\lambda^{2})}$ . If $\hat{\pi}(x)\propto e^{-x^{2}/(2\sigma^{2})}$ , with $\sigma^{2}\geq 1$ then $\tilde{\mathsf{P}}$ is variance bounding and geometrically ergodic.

Examples 1 and 2 suggest an intuition that problems may arise when $\hat{\pi}(x)$ has lighter tails than $\pi(x)$ . As we shall see, this is a part of the story; however, in general, heavier tails are not sufficient to guarantee inheritance of the variance bounding property, and for a class of algorithms where heavy tails are sufficient, lighter tails can also be sufficient provided they are not too much lighter, in a sense we make precise.

3 Variance bounding: inheritance and equivalence

Throughout this section we use the following generic formulation for two Markov kernels.

Definition 1.

Let $P_{A}(x,\mbox{d}y)$ and $P_{B}(x,\mbox{d}y)$ be propose-accept-reject Markov kernels both targeting a distribution $\pi$ , and using, respectively, proposal densities of $q_{A}(x,y)$ and $q_{B}(x,y)$ and acceptance probabilities of $\alpha_{A}(x,y)$ and $\alpha_{B}(x,y)$ .

Theorem 1, below, follows from Lemma 1, which is proved in Section A.1.1. It generalises Corollary 12 of Roberts and Rosenthal, (2008) to allow for different acceptance probabilities and, more importantly, removes the need for a fixed, uniform minorisation condition. The minorisation needs only hold in a region $\mathcal{D}(x)$ such that under $P_{A}$ there is “unlikely” to be an accepted proposal in $\mathcal{D}(x)^{\complement}$ .

Lemma 1.

Let $P_{A}$ , $P_{B}$ , $q_{A}$ , $q_{B}$ , $\alpha_{A}$ and $\alpha_{B}$ be as in Definition 1, and let the conductances of $P_{A}$ and $P_{B}$ be $\kappa_{A}$ and $\kappa_{B}$ respectively. If $\kappa_{A}>0$ and there is an $\epsilon<\kappa_{A}$ and a $\delta>0$ such that for $\pi$ -almost all $x\in\mathcal{X}$ , there is a region $\mathcal{D}(x)\in\mathcal{F}$ such that

[TABLE]

and

[TABLE]

then $\kappa_{B}\geq(1-\epsilon/\kappa_{A})\delta\kappa_{A}$ .

If $P_{A}$ is variance bounding, $\kappa_{A}>0$ ; choose an $\epsilon\in(0,\kappa_{A})$ and for each $x$ a corresponding $\mathcal{D}(x)$ so as to satisfy (7) and (8) to obtain:

Theorem 1.

Let $P_{A}$ , $P_{B}$ , $q_{A}$ , $q_{B}$ , $\alpha_{A}$ and $\alpha_{B}$ be as in Definition 1. If $P_{A}$ is variance bounding and for any $\epsilon>0$ there is a $\delta>0$ such that for $\pi$ -almost all $x\in\mathcal{X}$ there is a region $\mathcal{D}(x)\in\mathcal{F}$ such that (7) and (8) hold, then $P_{B}$ is also variance bounding.

The relationship between conductance and right spectral gap has recently Lee and Latuszyński, (2014); Rudolph and Sprungk, (2016) been used in other contexts to bound the behaviour of one Markov kernel in terms of that of another. Lemma 1 itself shows that condition (7) need only hold for a single $\epsilon<\kappa_{A}$ ; however, since in practice $\kappa_{A}$ is unlikely to be known, the conditions of Theorem 1 are more practically useful.

From Section 4 we apply Theorem 1 to provide sufficient conditions for a delayed-acceptance kernel to inherit variance bounding from its Metropolis-Hastings parent. However, if a DA kernel is variance bounding then so is its parent MH kernel. Thus, the sufficient conditions in Section 4 imply an equivalence between the two kernels with respect to the variance bounding property. In this section, after two key definitions, we return, briefly, to this equivalence with regard to the variance bounding property and provide sufficient conditions for equivalence (over potential targets) between Metropolis-Hastings kernels arising from two different proposal densities.

The most natural special case of (7) in practice is where the kernel is uniformly local, which we define as follows:

Definition 2.

(Uniformly Local) A proposal is uniformly local if, given any $\epsilon>0$ ,

[TABLE]

A propose-accept-reject kernel is defined to be uniformly local when its proposal is uniformly local.

Here and throughout this article, $B(x,r):=\{y\in\mathcal{X}:\left|\left|{y-x}\right|\right|<r\}$ is the open ball of radius $r$ centred on $x$ . In our examples, $\left|\left|{x}\right|\right|$ indicates the Euclidean norm, although the results are equally valid for other norms such as the Mahalanobis norm.

Control of the ratio $q(y,x)/q(x,y)$ will also be important and so we define the following.

Definition 3.

For any proposal density $q(x,y)$ ,

[TABLE]

Clearly, the RWM is a uniformly local kernel; moreover $\Delta(x,y;q_{\rm RWM})=0$ . In contrast, on any target with unbounded support, the MHIS cannot be uniformly local; as we shall see, the behaviour of $\Delta$ is then irrelevant. For the MALA and the truncated MALA we have:

Proposition 2.

*.

(A) Let $q(x,y)$ be the proposal for the truncated MALA in (6) or for the MALA on a target where $\mbox{ess sup}_{x}\left|\left|{\nabla\log\pi(x)}\right|\right|=D<\infty$ . Then

(i) For all $x$ , $\mathbb{P}_{{q}}\left({\left|\left|{Y-x}\right|\right|>r}\right)\rightarrow 0$ uniformly in $x$ as $r\rightarrow\infty$ , so $q$ is uniformly local, as defined in (9).

(ii) $|\Delta(x,y;q)|\leq h(\left|\left|{y-x}\right|\right|):=D\left|\left|{y-x}\right|\right|+\lambda^{2}D^{2}/8$ .

(B) The proposal, $q(x,y)$ , for the MALA on a target where $\mbox{ess sup}\left|\left|{\nabla\log\pi(x)}\right|\right|=\infty$ is not uniformly local.*

The applicability of Lemma 1 and Theorem 1 ranges beyond delayed-acceptance kernels. Here we supply sufficient conditions for an equivalence between Metropolis–Hastings proposals.

Theorem 2.

Let $P_{A}$ , $P_{B}$ , $q_{A}$ , $q_{B}$ , $\alpha_{A}$ and $\alpha_{B}$ be as in Definition 1 except that $q_{A}(x,y)$ and $q_{B}(x,y)$ are uniformly local proposal kernels, with $\log q_{A}(x,y)-\log q_{B}(x,y)$ a continuous function from $\mathbb{R}^{2d}$ to $\mathbb{R}$ . If, for $\pi$ -almost all $x$ and for some function $h:[0,\infty)\rightarrow[0,\infty)$ with $h(r)<\infty$ for all $r<\infty$ ,

[TABLE]

then $P_{A}$ is variance bounding if and only if $P_{B}$ is variance bounding.

Thus, for example, any two random-walk Metropolis algorithms with Gaussian jumps are equivalent, in that if, on a particular target, one is variance bounding then so is the other. When restricted to targets with a continuous gradient this equivalence extends to truncated MALA algorithms. The continuity requirement on $\log q_{A}-\log q_{B}$ rules out, for example, an equivalence between a Gaussian random walk and a random walk where the proposal has bounded support; indeed, the latter may not even be ergodic if the target has gaps in its support.

4 Application to delayed-acceptance kernels

4.1 Key definitions and properties

For uniformly local kernels we will describe two general sets of sufficient conditions for (8) to hold. The first is based upon the fact that the acceptance probability for $\tilde{\mathsf{P}}$ can be written as

[TABLE]

where $\mathsf{r}_{1}$ and $\mathsf{r}_{2}$ are as defined in (3). So, if $|\log\mathsf{r}_{1}(x,y)|\leq m$ or $|\log\mathsf{r}_{2}(x,y)|\leq m$ then $\tilde{\alpha}(x,y)\geq e^{-m}\alpha(x,y)$ . The quantity $|\log\mathsf{r}_{2}(x,y)|=|[\log\hat{\pi}(y)-\log\pi(y)]-[\log\hat{\pi}(x)-\log\pi(x)]|$ measures the discrepancy between the error in the approximation at the proposed value and the error in the approximation at the current value. We name this intuitive quantity, the log-error discrepancy. The quantity $\log\mathsf{r}_{1}$ is less natural since it relates $\hat{\pi}(x),\hat{\pi}(y)$ and $q(x,y)$ .

The second set of conditions is based upon the fact that if either $\mathsf{r}_{1}(x,y)\leq 1$ and $\mathsf{r}_{2}(x,y)\leq 1$ or if $\mathsf{r}_{1}(x,y)\geq 1$ and $\mathsf{r}_{2}(x,y)\geq 1$ then $\tilde{\alpha}(x,y)=\alpha(x,y)$ , whatever the log-error discrepancy.

These considerations lead to the natural definitions of a ‘potential problem’ set, $\mathcal{M}_{m}(x)$ , and a ‘no problem’ set $\mathcal{C}(x)$ , as follows:

[TABLE]

Theorem 1 then leads directly to the following.

Corollary 1.

Let $\mathsf{P}$ be the Metropolis-Hastings kernel given in (2) and let $\tilde{\mathsf{P}}$ be the corresponding delayed-acceptance kernel given in (4). Suppose that for all $\epsilon>0$ there is an $m<\infty$ such that for $\pi$ -almost all $x$ there exists a set $\mathcal{D}(x)\subseteq\mathcal{X}$ such that

[TABLE]

and

[TABLE]

Subject to these conditions, if $\mathsf{P}$ is variance bounding then so is $\tilde{\mathsf{P}}$ .

When $\hat{\pi}$ has heavier tails than $\pi$ then for large $x$ , the set $\mathcal{C}(x)$ can play an important role in the inheritance of the variance bounding property. In a dimension $d>1$ , there are numerous possible definitions of ‘heavier tails’. The following is precisely that required for our purposes:

Definition 4.

(heavy tails) An approximate density $\hat{\pi}$ is said to have heavy tails with respect to a density $\pi$ if

[TABLE]

Intuitively, the left hand side is true when $x$ is ‘further from the centre’ (according to $\hat{\pi}$ ) than $y$ , and the implication is that the further out a point, the larger $\hat{\pi}$ is compared with $\pi$ .

For uniformly local kernels we show (Corollary 3) that it is sufficient that either the log error discrepancy should satisfy a growth condition that is uniform in $||y-x||$ , or (Theorem 3) that the tails of the approximation should be heavier than those of the target and that $|\Delta(x,y;q)|$ should satisfy a growth condition that is uniform in $\left|\left|{y-x}\right|\right|$ .

For all kernels, boundedness of the error $\hat{\pi}(x)/\pi(x)$ away from [math] and $\infty$ will ensure the required inheritance (Corollary 2). This is a very strong condition, but we exhibit MHIS and MALA algorithms where the weaker conditions, that are sufficient for a uniformly local kernel, are satisfied, but the DA kernel is not variance bounding even though the MH kernel is.

4.2 DA kernels with the same proposal

distribution as the parent

Suppose that for all $x\in\mathcal{X}$ , $\gamma_{lo}\leq\hat{\pi}(x)/\pi(x)\leq\gamma_{hi}$ , then $|\log\mathsf{r}_{2}(x,y)|\leq\log(\gamma_{hi}/\gamma_{lo})$ , so applying Corollary 1 with $\mathcal{D}(x)=\mathcal{X}$ and $m=\log\gamma_{hi}-\log\gamma_{lo}$ leads to:

Corollary 2.

Let $\mathsf{P}$ and $\tilde{\mathsf{P}}$ be as described in Corollary 1. If there exist $\gamma_{lo}>0$ and $\gamma_{hi}<\infty$ such that $\gamma_{lo}\leq\hat{\pi}(x)/\pi(x)\leq\gamma_{hi}$ , and if $\mathsf{P}$ is variance bounding then so is $\tilde{\mathsf{P}}$ .

A more direct proof of Corollary 2 is possible using Dirichlet forms. However, Corollary 1 comes into its own when the error discrepancy is unbounded.

We first provide a cautionary example which shows that once the errors are unbounded the delayed-acceptance kernel need not inherit the variance bounding property from the Metropolis-Hastings kernel even if the growth of the log error discrepancy is uniformly bounded or if $\hat{\pi}$ has heavier tails than $\pi$ .

Example 3.

Let $\mathcal{X}=\mathbb{R}$ , let $\mathsf{P}$ be an MHIS with $q(x,y)=q(y)=\pi(y)=e^{-y}\mathds{1}(y>0)$ , and let $\tilde{\mathsf{P}}$ be the corresponding delayed-acceptance kernel (4), with $\hat{\pi}(y)=ke^{-ky}\mathds{1}(y>0)$ with $k>0$ and $k\neq 1$ . $\mathsf{P}$ is geometrically ergodic, but $\tilde{\mathsf{P}}$ is neither geometrically ergodic nor variance bounding.

The problem with the algorithm in Example 3 is that for some $x$ values the proposal, $y$ , is very likely to be a long way from $x$ and yet $y\notin\mathcal{C}(x)$ . Our definition of a uniformly local proposal, (9), provides uniform control on the probability that $\left|\left|{y-x}\right|\right|$ is large. Since this is only strictly necessary for $y\notin\mathcal{C}(x)$ , (9) is stronger than necessary, but it is much easier to check.

Our first sufficient condition for uniformly local kernels insists on uniformly bounded growth in the log-error discrepancy except when $\tilde{\alpha}(x,y)=\alpha(x,y)$ . For $\pi$ -almost all $x$ and for some function $h:[0,\infty)\rightarrow[0,\infty)$ with $h(r)<\infty$ for all $r<\infty$ ,

[TABLE]

If a proposal is uniformly local, given $\epsilon>0$ find $r(\epsilon)$ according to (9). Then (16) implies that for $y\in B(x,r)$ , $\mathcal{M}_{h(r)}\subseteq\mathcal{C}(x)$ . Applying Corollary 1 with $\mathcal{D}(x)=B(x,r)$ leads to the following.

Corollary 3.

Let $\mathsf{P}$ and $\tilde{\mathsf{P}}$ be as described in Corollary 1. In addition let $q(x,y)$ be a uniformly local proposal as in (9), and let the error discrepancy satisfy (16). If $\mathsf{P}$ is variance bounding then so is $\tilde{\mathsf{P}}$ .

Because most of the mass from the proposal, $y$ , is not too far away from the current value, $x$ , the discrepancy between the error at $x$ and the error at $y$ remains manageable provided the discrepancy grows in a manner that is controlled uniformly across the statespace. Since the random walk Metropolis on an exponential target density is geometrically ergodic Mengersen and Tweedie, (1996) we may apply Corollary 3 with $h(r)=|k-1|r$ , and then Proposition 1, to obtain the following contrast to Example 3, and showing that the variance bounding property can be inherited even when the approximation has lighter tails than the target.

Example 4.

Let $\mathcal{X}=\mathbb{R}$ and let $\mathsf{P}$ be a RWM algorithm on $\pi(x)=e^{-x}\mathds{1}(x>0)$ using $q(x,y)\propto e^{-(y-x)^{2}/(2\lambda^{2})}$ . For any $k>0$ , let $\tilde{\mathsf{P}}$ be the corresponding delayed-acceptance RWM algorithm using a surrogate of $\hat{\pi}(x)=ke^{-kx}\mathds{1}(x>0)$ . $\tilde{\mathsf{P}}$ is variance bounding and geometrically ergodic.

As yet, the set $\mathcal{C}(x)$ has not played a part in any of our examples. It is precisely this set that allows a delayed-acceptance random walk Metropolis kernel to inherit the variance bounding property from its parent even when the error discrepancy is not controlled uniformly, provided $\hat{\pi}$ has tails that are heavier than those of $\pi$ . For general MH algorithms an additional control on the behaviour of $q$ is enough to guarantee inheritance of the variance bounding property.

Theorem 3.

Let $\mathsf{P}$ be the Metropolis-Hastings kernel given in (2) and let $\tilde{\mathsf{P}}$ be the corresponding delayed-acceptance kernel given in (4). Further, let $q(x,y)$ be a uniformly local proposal in the sense of (9), let $\pi$ and $\hat{\pi}$ be continuous, and let $\hat{\pi}$ have heavier tails than $\pi$ in the sense of (15). Suppose that, in addition, for any $\mathcal{D}(x)$ required by (13) and (14) there exists a function $h:[0,\infty)\rightarrow[0,\infty)$ with $h(r)<\infty$ for all $r<\infty$ , such that for $\pi$ -almost all $x$

[TABLE]

Subject to these conditions, if $\mathsf{P}$ is variance bounding then so is $\tilde{\mathsf{P}}$ .

We now consider the delayed-acceptance versions of the random walk Metropolis, the truncated MALA, and the MALA. Before doing this we provide the details of a property that was anticipated in Roberts and Tweedie, 1996a .

Proposition 3.

Let $\mathsf{P}_{\rm RWM}$ be a random walk Metropolis kernel using $q(x,y)\propto e^{-\frac{1}{2\lambda^{2}}\left|\left|{y-x}\right|\right|^{2}}$ and targeting a density $\pi(x)$ . Let $\mathsf{P}$ be a Metropolis-Hastings kernel on $\pi$ of the form $q(x,y)\propto e^{-\frac{1}{2}\lambda^{2}\left|\left|{y-x-v(x)}\right|\right|^{2}}$ , where $\mbox{$ \pi $-ess sup}_{x}\left|\left|{v(x)}\right|\right|<\infty$ . $\mathsf{P}_{\rm RWM}$ is variance bounding if and only if $\mathsf{P}$ is variance bounding.

Proposition 3 clearly applies to a truncated MALA kernel on $\pi(x)$ using $q$ as in (6). It, together with each of our subsequent results for the truncated MALA, also applies to a MALA kernel on a target where $\mbox{$ \pi $-ess sup}_{x}\left|\left|{\nabla\log\pi(x)}\right|\right|=D<\infty$ ; in practice, however, the useful set of such kernels is limited to targets with exponentially decaying tails, since MALA is not geometrically ergodic on targets with heavier tails Roberts and Tweedie, 1996a .

Given Proposition 2 and its prelude, a direct application of Theorem 3 then leads to the following.

Example 5.

Let $\mathsf{P}_{\rm RWM}$ and $\mathsf{P}_{\rm TMALA}$ be, respectively, a random walk Metropolis kernel and a truncated MALA kernel on the differentiable density, $\pi(x)$ . Let $\tilde{\mathsf{P}}_{\rm RWM}$ and $\tilde{\mathsf{P}}_{\rm TMALA}$ be the corresponding delayed-acceptance kernels, created as in (4) through the continuous density, $\hat{\pi}(x)$ . Suppose also that $\hat{\pi}$ has heavier tails than $\pi$ in the sense of (15). Subject to these conditions, if $\mathsf{P}_{\rm RWM}$ is variance bounding then so is $\tilde{\mathsf{P}}_{\rm RWM}$ , and if $\mathsf{P}_{\rm TMALA}$ is variance bounding then so is $\tilde{\mathsf{P}}_{\rm TMALA}$ .

The MALA is geometrically ergodic when applied to one-dimensional targets of the form $\pi(x)\propto e^{-|{x}|^{\beta}}$ for $\beta\in[1,2)$ Roberts and Tweedie, 1996a ; when $\beta=2$ geometric ergodicity occurs provided $\lambda$ is sufficiently small, and for $\beta>2$ the MALA is not geometrically ergodic. Even when $\beta>1$ , however, Theorem 3 does not apply because the proposal is not uniformly local.

Example 6.

Let $\mathcal{X}=\mathbb{R}$ and let $\mathsf{P}$ be a MALA algorithm on $\pi(x)\propto e^{-x^{\beta}}1(x>0)$ with $1\leq\beta<2$ . Let $\hat{\pi}(x)\propto e^{-x^{\gamma}}1(x>0)$ and let $\tilde{\mathsf{P}}$ be the corresponding delayed-acceptance MALA kernel (4) (i.e. using a proposal of $Y=x+\frac{1}{2}\lambda^{2}\nabla\log\pi(x)+\lambda Z$ , where $Z\sim N(0,1)$ ). $\tilde{\mathsf{P}}$ is neither geometrically ergodic nor variance bounding, except when $\gamma=\beta$ .

The contrast between the truncated MALA and the MALA in Examples (5) and (6) highlights the importance of a uniformly local proposal. In practice, however, if $\pi(x)$ is computationally expensive to evaluate then, typically, $\nabla\log\pi(x)$ will also be expensive to evaluate and it might seem more reasonable to base the proposal for delayed-acceptance MALA and delayed-acceptance truncated MALA on $\nabla\log\hat{\pi}(x)$ .

4.3 Kernels where the proposal is based upon $\hat{\pi}$

On some occasions, the proposal $q(x,y)$ is a function of the posterior, $\pi(x)$ , and on such occasions it may be expedient for the delayed-acceptance algorithm to use a proposal $\hat{q}(x,y)$ , which is based upon $\hat{\pi}(x)$ . The acceptance rate is $\tilde{\alpha}_{b}(x,y)=[1\wedge\mathsf{r}_{1b}(x,y)][1\wedge\mathsf{r}_{2}(x,y)]$ , where

[TABLE]

With $\overline{\widetilde{\alpha}}_{b}(x):=\mathbb{E}_{{q}}\left[{\tilde{\alpha}_{b}(x,Y)}\right]$ , the corresponding delayed acceptance kernel is

[TABLE]

Let $\mathsf{r}_{\rm hyp}(x,y):=\pi(y)\hat{q}(y,x)/[\pi(x)\hat{q}(x,y)]$ , $\alpha_{\rm hyp}(x,y)=1\wedge\mathsf{r}_{\rm hyp}(x,y)$ , and, with $\overline{\alpha}_{\rm hyp}(x)=\mathbb{E}_{{\hat{q}}}\left[{\alpha_{\rm hyp}(x,Y)}\right]$ , consider the hypothetical Metropolis-Hastings kernel:

[TABLE]

Now, $\tilde{\alpha}_{b}(x,y)\leq\alpha_{\rm hyp}(x,y)$ , so if $\mathsf{P}_{\rm hyp}$ is not variance bounding then $\tilde{\mathsf{P}}_{b}$ is not variance bounding either. There is an exact correspondence between $\mathsf{P}$ from the previous section, and $\mathsf{P}_{\rm hyp}$ , and it is natural to consider inheritance of geometric ergodicity from $\mathsf{P}_{\rm hyp}$ exactly as in the prevous section we considered inheritance from $\mathsf{P}$ . The theoretical results are analogous and will not be restated; moreover, the theoretical properties of kernels of the form $\mathsf{P}_{\rm hyp}$ are less well investigated. Instead we illustrate inheritance of variance bounding (or its lack) through two examples.

Example 7.

Let $\mathsf{P}_{\rm TMALA}$ be, a truncated MALA kernel on the differentiable density, $\pi(x)$ . Let $\tilde{\mathsf{P}}_{\rm TMALAb}$ be the corresponding delayed-acceptance kernel, created as in (18) through the differentiable density $\hat{\pi}(x)$ . $\tilde{\mathsf{P}}_{\rm TMALAb}$ inherits the variance bounding property from $\mathsf{P}_{\rm TMALA}$ if either of the following conditions holds. (i) There is uniformly bounded growth in the log error discrepancy, in the sense of (16), or (ii) $\hat{\pi}$ has heavier tails than $\pi$ in the sense of (15).

Our penultimate example suggests that a delayed-acceptance MALA based upon an approximation that has heavier (though not too much heavier) tails is a reasonable choice.

Example 8.

Let $\mathcal{X}=\mathbb{R}$ and let $\mathsf{P}$ be a MALA algorithm on $\pi(x)\propto e^{-x^{\beta}}\mathds{1}(x>0)$ with $1\leq\beta<2$ . Let $\hat{\pi}(x)\propto e^{-x^{\gamma}}\mathds{1}(x>0)$ and let $\tilde{\mathsf{P}}$ be the corresponding delayed-acceptance MALA kernel created as in (18) through the differentiable density $\hat{\pi}(x)$ . $\tilde{\mathsf{P}}$ is variance bounding $\iff$ $\tilde{\mathsf{P}}$ is geometrically ergodic $\iff 1\leq\gamma\leq\beta$ .

We summarise the consequences of Examples 5 to 8 for $\pi(x)\propto e^{-x^{\beta}}\mathds{1}(x>0)$ and $\hat{\pi}(x)\propto e^{-x^{\gamma}}\mathds{1}(x>0)$ in Table 1, filling in the two blanks with Example 9 below. The table displays the results in terms of variance bounding, which is equivalent to geometric ergodicity in all these cases by Proposition 1.

Example 9.

Let $\mathcal{X}=\mathbb{R}$ and let $\mathsf{P}$ be a RWM or truncated MALA algorithm on $\pi(x)\propto e^{-x^{\beta}}\mathds{1}(x>0)$ with $1\leq\beta<2$ . Let $\hat{\pi}(x)\propto e^{-x^{\gamma}}\mathds{1}(x>0)$ and let $\tilde{\mathsf{P}}$ be the corresponding delayed-acceptance RWM or truncated MALA kernel created either from (4) or (18) through the differentiable density $\hat{\pi}(x)$ . If $1<\beta<\gamma$ , $\tilde{\mathsf{P}}$ is neither geometrically ergodic nor variance bounding.

5 Numerical demonstrations

The theoretical results from Section 4 were made more concrete through Examples 1 to 9. In this section we investigate the numerical performance of delayed acceptance algorithms in examples similar to those used in earlier sections. The specific targets in the earlier Examples were chosen to demonstrate particular points as simply as possible; here we deliberately investigate a broader class of targets, the exponential family class (e.g. Roberts and Tweedie, 1996a, ; Livingstone et al.,, 2019):

[TABLE]

The parameters $\beta$ and $\gamma$ in (19) govern the lightness of the tails in the target and the approximation to it respectively, and allow us to vary these separately.

A lack of variance bounding can be seen in terms of the chain struggling to leave a certain region, which typically has a low probability under $\pi$ . In practice, this lack of variance bounding (or a lack of geometric ergodicity) can manifest in two ways.

When a sensible starting value is not known, a starting value with poor properties may be chosen unwittingly and the algorithm may struggle to move from this initial point or region of the space. 2. 2.

Even when started from a reasonable value, over the course of a sufficiently long run the algorithm will visit this “danger region” and then struggle to leave.

For the target (20), the “danger region” corresponds to the tails of $\pi$ .

Our experiments deliberately start the algorithm in the tails of $\pi$ and then measure the number of iterations to reach the centre of the distribution. To make “reaching the centre” concrete, we find the number of iterations until $||x||$ is less than its median value under $\pi$ . To decide where in the tails we start, we set $||x||$ to its $1-p_{0}$ quantile under $\pi$ , for $p_{0}\in\{10^{-1},10^{-2},\dots,10^{-6}\}$ in Scenarios (i) and (ii), and $p_{0}\in\{10^{-4},10^{-8},\dots,10^{-24}\}$ in Scenarios (iii) and (iv); we start the algorithm from a uniformly random point on the surface of that hypersphere. In practical MCMC, many runs are of $\mathcal{O}(10^{6})$ iterations, so it is not unreasonable that issues which are detected for $p_{0}\geq 10^{-8}$ might occur in practice even when the algorithm is started from a sensible value. We work in dimension $d=5$ and repeat each experiment $20$ times, except for scenario (iii) where we repeat $10$ times to avoid excessive clutter.

We consider four specific scenarios, and so as to bound the amount of computing time, in each scenario we set a maximum number of iterations for which the algorithm should be run. In all scenarios the time until convergence increases with the starting quantile, whether or not the algorithm is variance bounding, for the most part simply because the algorithm is starting further from the main mass of the target. However, when the disparity between algorithms grows towards an order of magnitude, this suggests danger.

For the DARWM and DAMALA, the scaling parameter, $\lambda$ , was chosen so that for the RWM or MALA itself, the acceptance rate was a little larger than the theoretical optimum values of approximately $23\%$ and $57\%$ respectively. DATMALA used the same scaling as DAMALA and a truncation value such that when TMALA explored the true posterior, fewer than $4\%$ of the gradients were truncated.

Scenario i ( $\beta=\gamma=2$ , $\kappa=1/2$ and $\kappa=2$ ). The results appear in the top-left of Figure 1 and demonstrate the undesirable behaviour when the target and the approximation are both Gaussian but the approximation has lighter tails than the target (see Example 1), and the reasonable behaviour when the approximation’s tails are less tight than the target’s (Example 2).

Scenario ii ( $\beta=\gamma=1$ , $\kappa=1/2$ and $\kappa=2$ ). The results, in the top-right of Figure 1, demonstrate that, in alignment with Example 3, the worst behaviour by some margin is exhibited by the only non-variance bounding algorithm: the independence sampler where $\hat{\pi}$ uses a smaller scaling than $\pi$ has. In particular, aligning with Example 4, the DARWM that uses the same $\hat{\pi}$ as the poor independence sampler performs only marginally worse than the DARWM which uses the notionally ‘safer‘ $\hat{\pi}$ .

Scenario iii ( $\beta=1.5$ , $\kappa=1$ , $\gamma=1.2$ and $\gamma=1.8$ ). This corresponds to Examples 5, 6 and 9 and is consistent with the DARWM and DATMALA, but not DAMALA, being variance bounding when $1<\gamma<\beta$ , and none being variance bounding when $\gamma>\beta$ .

Scenario iv ( $\beta=1.5$ , $\kappa=1$ , $\gamma=1.2$ and $\gamma=1.8$ , proposal uses $\nabla\log\hat{\pi}$ ) and suggests that as with Examples 7 and 8 , DAMALA and DATMALA are both variance bounding when $\gamma<\beta$ , and following Example 9, neither is variance bounding when $\gamma>\beta$ .

In scenarios (i), (iii) and (iv) the target itself has lighter-than exponential tails, so even though the x-axis is linear in $\log p$ it is sublinear in the magnitude of the initial value, $||x_{0}||$ . Hence, issues with the algorithms might be expected to appear more slowly as $-\log p_{0}$ increases than they do with scenario (ii). Whilst exceptionally poor behaviour is unlikely to be seen, therefore, during a typical run that has been started from the main posterior mass, it could easily occur as a result of a poor starting value.

6 Discussion

Delayed acceptance Metropolis-Hastings algorithms are popular when the posterior is computationally intensive to evaluate yet a cheap approximation is available. Approximations can arise through many mechanisms, including the coarsening of a numerical-integration grid, subsampling from big data, Gaussian process approximation and nearest neighbour averaging. To date, with the exception of Franks and Vihola, (2020) and a note in Banterle et al., (2019), little consideration has been given to the properties of the resulting algorithm and, in particular as to whether the delayed-acceptance algorithm might inherit good properties, such as variance bounding, from its parent Metropolis-Hastings algorithm. From the MCMC output, one might reasonably hope to be able to estimate any quantity with a finite variance under $\pi$ and be confident that the Monte Carlo error would reduce in inverse proportion to the square-root of the run length; however, if the algorithm is not variance bounding then this may not be the case.

We have investigated the inheritance of the variance bounding property and provided sufficient conditions for it to occur. A general rule of thumb for algorithms with uniformly local (see Definition 2) proposals, such as the random walk Metropolis and the truncated MALA, is that the approximation should have heavier tails (see Definition 4) than the target; however, this is not always necessary (see Example 4). The MALA algorithm does not enjoy the same good properties as the truncated MALA and, in particular, does not necessarily inherit variance bounding even when the approximation does have heavier tails than the target (see Example 6).

A note of caution is also in order: variance bounding (and/or geometric ergodicity) are helpful properties as, in particular, they guarantee the existence of a usual central limit theorem for ergodic averages. However, whilst non-zero, the conductance of a kernel could be exceedingly small (or the geometric rate of convergence execptionally close to one) so that the algorithm might not be useful in practice. Thus, whilst we recommend following the advice in this article when choosing the approximation so as to reduce the chance of false confidence in the resulting Monte Carlo estimates, one should also continue to check other diagnostics, such as trace plots, and to vary any tuning parameters to optimise performance.

Acknowledgements: This paper was motivated by initial conversations with Alexandre Thiery about the preservation, and lack of preservation, of geometric ergodicity for delayed-acceptance kernels.

Data availability: Data sharing is not applicable to this article as no new data were created or analysed in this study.

Appendix A Proofs of results

A.1 Proofs of results in Section 3

A.1.1 Proof of Lemma 1

Since $\epsilon<\kappa_{A}$ we may define $\beta\in(0,1)$ such that $\epsilon=(1-\beta)\kappa_{A}$ . For any $\mathcal{A}\in\mathcal{F}$ ,

[TABLE]

Integrating both sides over $x\in\mathcal{A}$ with respect to $\pi$ gives

[TABLE]

The result follows since only sets with $\pi(\mathcal{A})>0$ are relevant. $\square$

A.1.2 Proof of Proposition 2

Let $\nu(x):=\frac{1}{2}\lambda^{2}\nabla\log\pi(x)$ .

(A) (i) $\left|\left|{Y-x}\right|\right|=\left|\left|{\nu(x)+\lambda Z}\right|\right|\leq\frac{1}{2}\lambda^{2}D+\lambda\left|\left|{Z}\right|\right|$ , where $Z$ is a vector with iid $\mathsf{N}(0,1)$ components. So $\mathbb{P}\left({\left|\left|{Y-x}\right|\right|>r}\right)\leq\mathbb{P}\left({\lambda\left|\left|{Z}\right|\right|+\frac{1}{2}\lambda^{2}D>r}\right)\rightarrow 0$ as $r\rightarrow\infty$ .

(ii) Algebra shows that

[TABLE]

Since $\left|\left|{\nu(x)}\right|\right|\leq\lambda^{2}D/2$ , $|\Delta(x,y;q)|\leq h(\left|\left|{y-x}\right|\right|):=D\left|\left|{y-x}\right|\right|+\lambda^{2}D^{2}/8$ , as required.

(B) Let $Z_{*}=Z\cdot\nabla\log\pi(x)/\left|\left|{\nabla\log\pi}\right|\right|\sim N(0,1)$ . For any $D>0$ we may find $\mathcal{A}_{D}\in\mathcal{F}$ with $\pi(\mathcal{A}_{D})>0$ and $\left|\left|{\nabla\log\pi(x)}\right|\right|\geq D$ for all $x\in\mathcal{A}_{D}$ . Hence, for $x\in\mathcal{A}_{D}$ ,

[TABLE]

So for any $r>0$ , $\mathbb{P}\left({\left|\left|{Y-x}\right|\right|\geq r}\right)\geq\mathbb{P}\left({|Z_{*}|\leq\frac{1}{2}\lambda D-r/\lambda}\right)$ , which can be made as close to $1$ as desired by taking $D$ to be sufficiently large. $\square$

A.1.3 Proof of Theorem 2

Since $\kappa_{A}>0$ and $q_{A}$ is uniformly local, we may take $\mathcal{D}(x)=\overline{B}(x,r)$ , the closure of $B(x,r)$ , where $r$ is chosen so as to satisfy (9) for $\pi$ -almost all $x$ , and with $\epsilon=\kappa_{A}/2$ .

Next, let $t=0\wedge(c+b)-0\wedge(c+a)$ . Since $0\wedge(c+a)$ is upper bounded by both [math] and $c+a$ , if $0<c+b$ then $t\geq 0$ , and if $c+b<0$ then $t\geq b-a$ . Thus $t\geq 0\wedge(b-a)$ . Hence for $y\in\mathcal{D}(x)$ ,

[TABLE]

The first term is bounded on $\mathcal{D}(x)$ since $\log q_{B}(x,y)-\log q_{A}(x,y)$ is continuous, and so (8) holds and we may apply Lemma 1. Repeat with $A\leftrightarrow B$ . $\square$

A.2 Proofs of results in Section 2

A.2.1 Proof of Proposition 1

Since $\mathsf{P}$ is geometrically ergodic, it must have a left spectral gap. From (5),

[TABLE]

where the Dirichlet form for the functional $f$ of the Markov chain is

[TABLE]

for a propose-accept-reject chain.

Since $\tilde{\alpha}(x,y)\leq\alpha(x,y)$ , $\mathcal{E}_{\tilde{\mathsf{P}}}(f)\leq\mathcal{E}_{\mathsf{P}}(f)$ , So if $\mathsf{P}$ is geometrically ergodic, then $\mbox{Gap}_{L}(\tilde{\mathsf{P}})\geq\mbox{Gap}_{L}(\mathsf{P})>0$ . The result follows as all geometrically ergodic kernels are variance bounding and the only kernels which are variance bounding but not geometrically ergodic have no left-spectral gap, yet we have just shown that $\tilde{\mathsf{P}}$ must have a left spectral gap because $\mathsf{P}$ is geometrically ergodic. $\square$

A.2.2 Proof of Example 3

Since $\alpha(x,y)=1$ , the MH algorithm produces iid samples from $\pi$ and so it is geometrically ergodic (with a spectral gap of $1$ ) and, hence, variance bounding . For the DA algorithm,

[TABLE]

For any $r>0$ let

[TABLE]

For $(x,y)\in\mathcal{A}_{r}\times\mathcal{B}_{r}$ , $\tilde{\alpha}(x,y)\leq e^{-|k-1|r}$ , whilst for $(x,y)\in\mathcal{A}_{r}\times\mathcal{C}_{r}$ , $\tilde{\alpha}(x,y)\leq 1$ . Also, $\int_{\mathcal{B}_{r}}q(x,y)\mbox{d}y\leq 1$ , whilst $\int_{\mathcal{C}_{r}}q(x,y)\mbox{d}y=e^{-kr}-e^{-2kr}\leq e^{-kr}$ . Therefore, for $r>\log 2$ (so $\pi(\mathcal{A}_{r})<1/2$ ) the flow out of $\mathcal{A}_{r}$ satisfies

[TABLE]

So, for any $\epsilon>0$ $\exists r$ such that $\kappa_{DA}(\mathcal{A}_{r})<\epsilon$ , and the conductance of the chain is therefore [math]; the chain is not variance bounding. The lack of geometric ergodicity follows from Proposition 1. $\square$

A.3 Proofs of results in Section 4

A.3.1 Shorthand for delayed-acceptance kernels

The following short-hand is used through the remainder of this section.

[TABLE]

A.3.2 Proof of Theorem 3

For any $\epsilon>0$ , by (9), choose $r_{\epsilon}$ such that $\int_{B(x,r_{\epsilon})^{c}}q(x,y)\mbox{d}y\leq\epsilon$ ; set $\mathcal{D}(x):=B(x,r_{\epsilon})$ so (14) holds.

The ‘heavier-tail’ condition (15) is equivalent to $b_{1}(x,y)+\Delta(x,y;q)\geq 0\Rightarrow b_{2}(x,y)\geq 0$ . Applying the identity $[b_{1}(y,x),b_{2}(y,x),\Delta(y,x;q)]=-[b_{1}(x,y),b_{2}(x,y),\Delta(x,y;q)]$ and relabelling $x$ and $y$ then gives $b_{1}(x,y)+\Delta(x,y;q)\leq 0\Rightarrow b_{2}(x,y)\leq 0$ .

Next, suppose that $\left|\left|{x}\right|\right|>r_{*}$ , $\left|\left|{y}\right|\right|>r_{*}$ and $y\in\mathcal{D}(x)$ but $y\notin\mathcal{C}(x)$ , so that $b_{1}(x,y)$ and $b_{2}(x,y)$ have opposite signs. By (17) $|\Delta(x,y;q)|\leq h(r_{\epsilon})$ , and the implications derived in the previous paragraph then imply that both $|b_{1}(x,y)|\leq h(r_{\epsilon})$ and $|b_{2}(x,y)|\leq h(r_{\epsilon})$ . Thus $\mathcal{D}(x)\cap\mathcal{M}_{h(r_{\epsilon})}(x)\subseteq\mathcal{C}(x)$ .

Finally, let $\overline{B}(x,r)$ be the closure of $B(x,r)$ , let $\mathcal{D}:=\overline{B}(0,r_{\epsilon}+r_{*})\times\overline{B}(0,2r_{\epsilon}+r_{*})$ and let $m_{*}:=\sup_{(x,y)\in\mathcal{D}}|b_{2}(x,y)|$ ; since $b_{2}(x,y)$ is continuous and $\mathcal{D}$ is compact, $m_{*}<\infty$ . For $x\in\overline{B}(0,r_{\epsilon}+r_{*})$ and $y\in\mathcal{D}(x)$ , $(x,y)\in\mathcal{D}$ and so $|b_{2}(x,y)|\leq m_{*}$ . For $x\in\overline{B}(0,r_{\epsilon}+r_{*})^{c}$ and $y\in\mathcal{D}(x)$ , $\left|\left|{x}\right|\right|>r_{*}$ and $\left|\left|{y}\right|\right|>r_{*}$ and, from the previous paragraph, $\mathcal{D}(x)\cap\mathcal{M}_{h(r_{\epsilon})}(x)\subseteq\mathcal{C}(x)$ . Hence (13) holds with $m=\max(h(r_{\epsilon}),m_{*})$ , and the result follows from the proof of Corollary 1. $\square$

A.3.3 Proof of Proposition 3

Let $q(x,y)$ and $q_{\rm RWM}(x,y)$ the be proposal densities for the Metropolis-Hastings and RWM algorithms, respectively, let $\alpha(x,y)$ and $\alpha_{\rm RWM}(x,y)$ be the corresponding acceptance probabilities, and let $v_{*}:=\mbox{ess sup}~{}\left|\left|{v(x)}\right|\right|$ .

Firstly, if $\int_{B(x,r)^{c}}q_{\rm RWM}(x,y)\mbox{d}y<\epsilon$ then $\int_{B(x,r+v_{*})^{c}}q(x,y)\mbox{d}y<\epsilon$ ; since $q_{\rm RWM}$ is uniformly local, so, therefore, is $q$ . Next, algebra shows that

[TABLE]

Now consider $y\in B(x,r)$ and apply the triangle inequality to obtain,

[TABLE]

Thus $q(x,y)\geq e^{-m(r)}q_{\rm RWM}(x,y)$ and $q_{\rm RWM}(x,y)\geq e^{-m(r)}q(x,y)$ . Also $q(x,y)\alpha(x,y)=q(x,y)\wedge[q(y,x)\pi(y)/\pi(x)]\geq e^{-m(r)}q_{\rm RWM}(x,y)\alpha_{\rm RWM}(x,y)$ and, similarly, $q_{\rm RWM}(x,y)\alpha_{\rm RWM}(x,y)\geq e^{-m(r)}q(x,y)\alpha(x,y)$ . Both implications then follow from Theorem 1 with $\mathcal{D}(x)=B(x,r)$ and with $r$ chosen so that both $\int_{B(x,r)^{c}}q_{\rm RWM}(x,y)\mbox{d}y\leq\epsilon$ and $\int_{B(x,r)^{c}}q(x,y)\mbox{d}y\leq\epsilon$ . $\square$

A.3.4 Proof of Example 7

Let $\mathsf{P}_{\rm RWM}$ be the RWM kernel using a Gaussian proposal and targeting $\pi$ , as in Proposition 3. Applying Proposition 3 twice shows that $\mathsf{P}_{\rm TMALA}$ is variance bounding if and only if $\mathsf{P}_{\rm RWM}$ is variance bounding, which occurs if and only if $\mathsf{P}_{\rm hyp}$ is variance bounding. The sufficiency of (i) then arises directly from Corollary 3. For (ii), by Proposition 2 A(ii), applied to the proposal $\hat{q}$ , we may use Theorem 3. Geometric ergodicity then follows from Proposition 1. $\square$

A.3.5 Proofs of Examples 6 and 8

Let the proposal be $Y=x-\frac{1}{2}\lambda^{2}\xi x^{\xi-1}+\lambda Z$ , where $Z\sim\mathsf{N}(0,1)$ . Example 6 uses $\xi=\beta$ and Example 8 uses $\xi=\gamma$ . Now $\nu(x)=-\xi\lambda^{2}x^{\xi-1}/2$ , so, from (21),

[TABLE]

Also $Y^{k}=x^{k}\left(1-\frac{1}{2}\lambda^{2}\xi x^{\xi-2}+\lambda Z/x\right)^{k}$ . Hence, for $k>0$ ,

[TABLE]

as $x\rightarrow\infty$ , and where here, and throughout this proof $\stackrel{{\scriptstyle p}}{{\rightarrow}}$ indicates convergence in probability. Now

[TABLE]

However, if $\xi<2$ , $x^{3\xi-4}/x^{2\xi-2}\rightarrow 0$ as $x\rightarrow\infty$ , so, for $1\leq\xi<2$ ,

[TABLE]

as $x\rightarrow\infty$ . Finally,

[TABLE]

Example 6 ( $\xi=\beta$ ). Consider the behaviour of $b_{1}(x,Y)$ and $b_{2}(x,Y)$ as $x\rightarrow\infty$ . If $1\leq\gamma<\beta$ , $b_{1}$ is dominated by $x^{2\beta-2}T_{x,\beta}$ and so $b_{1}\stackrel{{\scriptstyle p}}{{\rightarrow}}-\infty$ . If $1\leq\beta<\gamma$ , $b_{1}$ is dominated by $x^{\beta+\gamma-2}V_{x,\gamma}$ and $b_{2}$ is dominated by $-x^{\beta+\gamma-2}V_{x,\gamma}$ ; thus, when $\beta>1$ , $b_{2}\stackrel{{\scriptstyle p}}{{\rightarrow}}-\infty$ , and when $\beta=1$ , either $b_{1}\stackrel{{\scriptstyle p}}{{\rightarrow}}-\infty$ or $b_{2}\stackrel{{\scriptstyle p}}{{\rightarrow}}-\infty$ , depending on the value of $Z$ . In either case, $\tilde{\alpha}(x,Y)\stackrel{{\scriptstyle p}}{{\rightarrow}}0$ as $x\rightarrow\infty$ . Given $\epsilon>0$ , we choose $x_{*}$ such that for all $x>x_{*}$ $\mathbb{P}\left({\tilde{\alpha}(x_{*},Y)>\epsilon}\right)<\epsilon$ and set $\mathcal{A}_{x}:=[x,\infty)$ , so that $\kappa(\mathcal{A}_{x_{*}})<2\epsilon$ . But $\epsilon$ can be made as small as desired, so $\kappa_{DA}=0$ .

Example 8 ( $\xi=\gamma$ ). When $1\leq\beta<\gamma$ , as $x\rightarrow\infty$ , $b_{2}$ is dominated by $-x^{2\gamma-2}V_{x,\gamma}\stackrel{{\scriptstyle p}}{{\rightarrow}}-\infty$ so $\kappa_{DAb}=0$ , by an analogous argument to that for Example 6.

If $1\leq\gamma<\beta$ then we note that the proof of geometric ergodicity of the MALA algorithm in Theorem 4.1 of Roberts and Tweedie, 1996a applies to any algorithm with a proposal of the form $q(x,y)\propto\exp[-|y-[x+\nu(x)]|^{2}/2\lambda^{2}]$ . For an irreducible and aperiodic kernel with a continuous proposal density, such as the one under consideration, geometric ergodicity is therefore guaranteed provided the following two conditions are satisfied:

[TABLE]

where $\mathsf{I}(x):=\{y:|y|\leq|x|\}$ is the interior and $\mathsf{R}(x):=\{y:\alpha(x,y)<1\}$ is the region where a rejection is possible. Now, $x+\nu(x)=x-\gamma x^{\gamma-1}$ so $\gamma\geq 1\Rightarrow\eta>0$ . We will show that if $\gamma\in[1,2)$ , for $x>1$ and $y\in\mathsf{I}(x)$ , both $b_{1}(x,y)\geq 0$ and $b_{2}(x,y)\geq 0$ , so that $\tilde{\alpha}_{b}(x,y)=1$ and hence $\mathsf{R}(x)\cap\mathsf{I}(x)$ is empty.

Now $b_{2}(x,y)=x^{\gamma}(x^{\beta-\gamma}-1)+y^{\gamma}(y^{\beta-\gamma}-1)$ , so if $x\geq y$ and $x\geq 1$ then $b_{2}(x,y)\geq 0$ . Further, from (24),

[TABLE]

The final term is non-negative when $x\geq y$ . Directly from the concavity of $f(t)=\gamma t^{\gamma-1}$ , we obtain

[TABLE]

so the sum of the first two terms is also non-negative when $x\geq y$ . Hence, for $x\geq 1$ , $\mathsf{I}(x)\cap\mathsf{R}(x)$ is empty, as claimed. $\square$

A.3.6 Proof of Example 9

Consider any proposal of the form $Y=x+\lambda^{2}\nu(x)+\lambda Z$ , where $Z\sim\mathsf{N}(0,1)$ and $|\nu(x)|\leq\nu_{*}$ for all $x$ . Firstly,

[TABLE]

Secondly, as $x\rightarrow\infty$ ,

[TABLE]

Also, for any $\delta>0$ ,

[TABLE]

The intermediate value theorem supplies: $\log\hat{\pi}(Y)-\log\hat{\pi}(x)=\eta^{\gamma-1}(Y-x)$ for some $\eta\geq\min(x,Y)$ . Given any $\epsilon>0$ , set $\delta=\sqrt{2\pi}\epsilon/4$ and choose $x_{*}$ such that $\mathbb{P}\left({Y>x/2}\right)>1-\epsilon/2$ for all $x>x_{*}$ . Then with probability at most $\epsilon$ , $|\log\hat{\pi}(Y)-\log\hat{\pi}(x)|>(x/2)^{\gamma-1}\{|Y-x|\vee\delta\}$ . Next,

[TABLE]

Hence, $|\log\hat{\pi}(Y)-\log\hat{\pi}(x)|\rightarrow\infty$ and dominates $\Delta(x,Y)$ in probability as $x\rightarrow\infty$ ; i.e., $\log r_{1}(x,Y)\sim\log\hat{\pi}(Y)-\log\hat{\pi}(x)$ and $|\log r_{1}(x,Y)|\rightarrow\infty$ in probability as $x\rightarrow\infty$ .

After some algebra,

[TABLE]

Since $\gamma>\beta$ , as $x$ (and hence, $Y$ ), becomes large, we have

[TABLE]

in probability as $x\rightarrow\infty$ .

Combining these two ideas, $0\wedge\log r_{1}(x,Y)+0\wedge\log r_{2}(x,Y)\rightarrow-\infty$ in probability. Thus $\mathsf{ess}~{}\sup_{x}P(x,\{x\})=1$ and the algorithm cannot be geometrically ergodic by Theorem 5.1 of Roberts and Tweedie, 1996b ; by Proposition 1 it also cannot be variance bounding.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Banterle et al., (2019) Banterle, M., Grazian, C., Lee, A., and Robert, C. P. (2019). Accelerating metropolis-hastings algorithms by delayed acceptance. Foundations of Data Science , 1(2639-8001-2019-2-103):103.
2Besag, (1994) Besag, J. (1994). In discussion of ‘Representations of knowledge in complex systems’ by U. Grenander and M. Miller. J. Roy. Stat. Soc. Ser. B , 56:591–592.
3Christen and Fox, (2005) Christen, J. A. and Fox, C. (2005). Markov chain Monte Carlo using an approximation. J. Comp. Graph. Stat. , 14(4):795–810.
4Cui et al., (2011) Cui, T., Fox, C., and O’Sullivan, M. (2011). Bayesian calibration of a large-scale geothermal reservoir model by a new adaptive delayed acceptance metropolis hastings algorithm. Water Resources Research , 47(10).
5Franks and Vihola, (2020) Franks, J. and Vihola, M. (2020). Importance sampling correction versus standard averages of reversible MCM Cs in terms of the asymptotic variance. Stochastic Processes and their Applications . Early availability online.
6Geyer, (1992) Geyer, C. J. (1992). Practical Markov Chain Monte Carlo. Statistical Science , 7(4):473 – 483.
7Geyer, (2011) Geyer, C. J. (2011). Introduction to Markov chain Monte Carlo. In Brooks, S., Gelman, A., Jones, G. L., and Meng, X.-L., editors, Handbook of Markov chain Monte Carlo , Chapman & Hall/CRC Handbooks of Modern Statistical Methods, pages 3–48. CRC Press, Boca Raton, FL.
8Gilks et al., (1996) Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in practice . Chapman and Hall, London, UK.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Variance bounding of delayed-acceptance kernels

Abstract

1 Introduction

2 Background, notation and motivation

2.1 Metropolis-Hastings and delayed-acceptance kernels

2.2 Key terminology, equivalences and implications

2.3 Example algorithms

Example 1**.**

Proposition 1**.**

Example 2**.**

3 Variance bounding: inheritance and equivalence

Definition 1**.**

Lemma 1**.**

Theorem 1**.**

Definition 2**.**

Definition 3**.**

Proposition 2**.**

Theorem 2**.**

4 Application to delayed-acceptance kernels

4.1 Key definitions and properties

Corollary 1**.**

Definition 4**.**

4.2 DA kernels with the same proposal

Corollary 2**.**

Example 3**.**

Corollary 3**.**

Example 4**.**

Theorem 3**.**

Proposition 3**.**

Example 5**.**

Example 6**.**

4.3 Kernels where the proposal is based upon π^\hat{\pi}π^

Example 7**.**

Example 8**.**

Example 9**.**

5 Numerical demonstrations

6 Discussion

Appendix A Proofs of results

A.1 Proofs of results in Section 3

A.1.1 Proof of Lemma 1

A.1.2 Proof of Proposition 2

A.1.3 Proof of Theorem 2

A.2 Proofs of results in Section 2

A.2.1 Proof of Proposition 1

A.2.2 Proof of Example 3

A.3 Proofs of results in Section 4

A.3.1 Shorthand for delayed-acceptance kernels

A.3.2 Proof of Theorem 3

A.3.3 Proof of Proposition 3

A.3.4 Proof of Example 7

A.3.5 Proofs of Examples 6 and 8

A.3.6 Proof of Example 9

Example 1.

Proposition 1.

Example 2.

Definition 1.

Lemma 1.

Theorem 1.

Definition 2.

Definition 3.

Proposition 2.

Theorem 2.

Corollary 1.

Definition 4.

Corollary 2.

Example 3.

Corollary 3.

Example 4.

Theorem 3.

Proposition 3.

Example 5.

Example 6.

4.3 Kernels where the proposal is based upon $\hat{\pi}$

Example 7.

Example 8.

Example 9.