Variance bounding of delayed-acceptance kernels
Chris Sherlock, Anthony Lee

TL;DR
This paper investigates conditions under which delayed-acceptance Metropolis-Hastings algorithms inherit variance bounding properties from their parent kernels, improving computational efficiency in Bayesian inference.
Contribution
It provides sufficient conditions for delayed-acceptance kernels to inherit variance bounding, enhancing understanding of their efficiency in computationally expensive Bayesian inference.
Findings
Delayed-acceptance kernels can be variance bounding under certain conditions.
Bounded discrepancy between approximate and true log densities ensures inheritance.
Sufficient conditions for proposal pairs to preserve variance bounding property.
Abstract
A delayed-acceptance version of a Metropolis--Hastings algorithm can be useful for Bayesian inference when it is computationally expensive to calculate the true posterior, but a computationally cheap approximation is available; the delayed-acceptance kernel targets the same posterior as its associated "parent" Metropolis-Hastings kernel. Although the asymptotic variance of the ergodic average of any functional of the chain cannot be less than that obtained using its parent, the average computational time per iteration can be much smaller and so for a given computational budget the delayed-acceptance kernel can be more efficient. When the asymptotic variance of the ergodic averages of all functionals of the chain is finite, the kernel is said to be variance bounding. It has recently been noted that a delayed-acceptance kernel need not be variance bounding even when its parent is.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Variance bounding of delayed-acceptance kernels
Chris Sherlock1 and Anthony Lee2
(1Department of Mathematics and Statistics, Lancaster University, Lancaster, LA1 4YF, UK; ORCID: 0000-0002-2429-3157; [email protected].
2School of Mathematics, Fry Buiding, University of Bristol, Bristol, BS8 1UG, UK; ORCID: 0000-0001-7765-0616.
)
Abstract
A delayed-acceptance version of a Metropolis–Hastings algorithm can be useful for Bayesian inference when it is computationally expensive to calculate the true posterior, but a computationally cheap approximation is available; the delayed-acceptance kernel targets the same posterior as its associated “parent” Metropolis-Hastings kernel. Although the asymptotic variance of the ergodic average of any functional of the delayed-acceptance chain cannot be less than that obtained using its parent, the average computational time per iteration can be much smaller and so for a given computational budget the delayed-acceptance kernel can be more efficient.
When the asymptotic variance of the ergodic averages of all functionals of the chain are finite, the kernel is said to be variance bounding. It has recently been noted that a delayed-acceptance kernel need not be variance bounding even when its parent is. We provide sufficient conditions for inheritance: for non-local algorithms, such as the independence sampler, the discrepancy between the log density of the approximation and that of the truth should be bounded; for local algorithms, two alternative sets of conditions are provided.
As a by-product of our initial, general result we also supply sufficient conditions on any pair of proposals such that, for any shared target distribution, if a Metropolis-Hastings kernel using one of the proposals is variance bounding then so is the Metropolis-Hastings kernel using the other proposal.
Keywords: Metropolis-Hastings; delayed-acceptance; variance bounding; conductance; geometric ergodicity.
**AMS:**Primary: 60J10; Secondary: 65C40;47A10
Declarations: funding - none; conflicts of interest - none; availability of data and material - n/a; code availability - R code to produce the plots in Section 5 is available from https://chrisgsherlock.github.io/Research/publications.html.
1 Introduction
The Metropolis-Hastings (MH) algorithm is widely used to approximately compute expectations with respect to complicated high-dimensional posterior distributions (e.g. Gilks et al.,, 1996; Geyer,, 2011). The algorithm requires that it be possible to evaluate point-wise the density of the distribution of interest (throughout this article, all densities are with respect to Lebesgue measure) up to an arbitrary constant of proportionality.
In many problems the target posterior density is computationally expensive to evaluate. When a computationally-cheap approximation, or surrogate, is available, the delayed-acceptance Metropolis-Hastings (DAMH) algorithm (Liu,, 2001; Christen and Fox,, 2005; Higdon et al.,, 2011, also known as the two-stage algorithm, and a special case of the surrogate-transition method) leverages the surrogate to produce a new Markov chain that still targets the original distribution of interest. A first ‘screening’ stage substitutes the surrogate density for the true density in the standard formula for the MH acceptance probability; proposals which fail at this stage are discarded. Only proposals that pass the first stage are considered in the second ‘correction’ stage, where it is necessary to evaluate the true posterior density at the proposed value.
Delayed acceptance (DA) algorithms have been applied in a variety of settings with the approximate density obtained in a variety of different ways, for example: a coarsening of a numerical grid in Bayesian inverse problems Christen and Fox, (2005); Moulton et al., (2008); Cui et al., (2011), subsampling from big-data Payne and Mallick, (2014); Banterle et al., (2019); Quiroz et al., (2018), a tractable approximation to a stochastic process Smith, (2011); Golightly et al., (2015), or a direct, nearest-neighbour approximation to the truth using previous values Sherlock et al., (2017).
For a Markov kernel, , with a stationary distribution of , and an associated chain , the asymptotic variance of any functional, , is defined to be
[TABLE]
where . A lower asymptotic variance is thus associated, in practice, with a greater accuracy in estimating using a realisation of length from the distribution of the chain. In terms of the asymptotic variance of any functional of the chain, the DAMH kernel cannot be more efficient than the parent MH kernel; however the computational cost per iteration is, typically, reduced considerably. The almost-negligible computational cost of the screening stage also, typically, facilitates proposals that have a larger chance of being rejected than the MH proposal, but where the pay-off on acceptance is so much larger that the expected overall movement per unit of time increases. When efficiency is measured in terms of effective samples per second, gains of over an order of magnitude have been reported (e.g. Golightly et al.,, 2015).
A Markov kernel with a stationary distribution of is termed variance bounding if for all , the Hilbert space of functions that are square-integrable with respect to . Equivalently there exists such that for all such . This property was named and studied in Roberts and Rosenthal, (2008), where it was shown to be equivalent to the existence of a ‘usual’ central limit theorem (CLT); that is, a CLT where the limiting variance is the asymptotic variance.
Intuitively, the variance-bounding property embodies desirable behaviour for a chain started at equilibrium. In practice, the chain is not started at equilibrium, but asymptotically the bias that results from this is negligible compared with the variance. An alternative natural requirement is that the chain converge to equilibrium geometrically quickly (rather than, say, polynomially quickly). A Markov chain kernel, , with stationary distribution is geometrically ergodic (e.g. Roberts and Rosenthal,, 1997, 2004; Meyn and Tweedie,, 1993, Chapter 15) if there exist and that is finite -almost everywhere, such that
[TABLE]
for all and , where denotes the -step transition kernel.
Although the motivations behind the definitions of variance bounding and geometric ergodicity, mixing at equilibrium and convergence to equilibrium, are quite different, for a large class of algorithms, including those studied in this article, these two properties are very closely linked as we will describe in Section 2.2. Indeed, for delayed-acceptance algorithms, under weak conditions the two properties are equivalent (see Proposition 1).
Theoretical properties of the efficiency of delayed-acceptance algorithms have been studied in Banterle et al., (2019), Sherlock et al., (2021) and Franks and Vihola, (2020). The first contribution from Banterle et al., (2019) is an example delayed-acceptance algorithm which fails to inherit geometric ergodicity from its parent Metropolis-Hastings algorithm (see Example 1 in Section 2.3 of this article); a simple sufficient condition for inheritance of geometric ergodicity, uniformly good behaviour of the ratios and that we define in (3), is also supplied. Finally, an idealised setting where the cheap approximation is perfectly accurate is explored to obtain tuning guidelines for in the delayed-acceptance random walk Metropolis algorithm. Sherlock et al., (2021) examines this tuning issue further, proving a limiting diffusion for the first component of the delayed-acceptance Markov chain, and providing robust tuning guidelines that account for the error in the cheap approximation; the article then extends these guidelines to the pseudo-marginal version of the algorithm. Finally, Franks and Vihola, (2020) compares the asymptotic variance of a general pseudo-marginal delayed-acceptance algorithm with the variance of an algorithm that applies importance-sampling to the output of an MCMC algorithm targeting the cheap approximation directly.
Using our Proposition 1, the lack of inheritance of geometric ergodicity in the example in Banterle et al., (2019) is equivalent to a lack of inheritance of the variance bounding property: even though the asymptotic variance using the parent MH kernel is finite for all , there exist for which the asymptotic variance using the DA kernel is infinite. For such , estimated quantities such as effective sample size (e.g. Hoff,, 2009) are invalid, and consequent, standard CLT-based intuitions about the sizes of typical errors in estimates of from the chain do not hold.
We investigate the conditions under which a DAMH kernel inherits variance bounding from its MH parent and, as a by product, discover conditions under which two different proposals produce MH kernels that are equivalent in terms of whether or not they are variance bounding. Section 2 provides the background and two motivating examples, while Section 3 provides some key definitions, a general inheritance result applicable to all propose-accept-reject kernels, and sufficient conditions for variance-bounding equivalence between two Metropolis-Hastings proposals. Section 4 contains our results for standard DA algorithms with further illustrative examples, and includes parent MH algorithms where the proposal depends upon the form of the density, so that the proposal for a computationally cheap DA kernel would naturally depend on the surrogate. Numerical experiments are performed in Section 5 and the article concludes with a discussion. All proofs are deferred to Appendix A.
2 Background, notation and motivation
Throughout this article all Markov chains are assumed to be on a statespace , with Lebesgue measurable, and the -algebra of all Lebesgue-measurable sets in . The target and surrogate distributions are denoted by and , respectively, and they are assumed to have densities of and with respect to Lebesgue measure.
2.1 Metropolis-Hastings and delayed-acceptance kernels
The Metropolis-Hastings kernel has a proposal density and an acceptance probability where
[TABLE]
With , the Metropolis-Hastings (MH) kernel is then
[TABLE]
An iteration of the corresponding MH algorithm proceeds from a current value, , to the next value, , as follows. A value is sampled from the distribution with a density of . With a probability of , , else .
Now, suppose that we have an approximation, , to . The standard delayed-acceptance kernel uses the same proposal, , but has an acceptance probability of , where
[TABLE]
With , the delayed-acceptance (DA) kernel is
[TABLE]
An iteration of the corresponding DA algorithm proceeds from a current value, , to a next value, , as follows.
Stage One:
A value is sampled from the distribution with a density of . With a probability of the algorithm proceeds to Stage Two, else .
Stage Two:
With a probability of , , else .
Now, , and so for each Peskun, (1973); Tierney, (1998). At first glance this might suggest that the DA algorithm is never worthwhile; however for any proposal that is rejected at Stage One there is no need to complete the expensive calculation of that is required at every iteration of the MH algorithm and in Stage Two of the DA algorithm. As mentioned in the Introduction, for a fixed computational time, the decreased average computational cost per iteration, and alterations of any tuning parameters to take advantage of this, can lead to a DA algorithm where the variance of an estimator can be over an order of magnitude smaller than that of the MH algorithm.
Since , if is variance bounding then so is ; however it is feasible that may be variance bounding while is not.
2.2 Key terminology, equivalences and implications
The MH and DA kernels are both reversible with respect to the target. A kernel is reversible with respect to a distribution iff for all and , This article utilises a number of existing results for reversible Markov chains on the relationship between variance bounding, conductance, spectral gaps and geometric ergodicity. Here we define conductance and spectral gaps and summarise the relationships between the four properties.
Define the Hilbert space with the inner product , and consider as an operator acting on according to . If is reversible then it is a self adjoint operator on and by the spectral theorem for bounded self-adjoint operators, for each , for some positive measure on . Let
[TABLE]
or, equivalently (e.g. Yosida,, 1980, p320, Theorem 2), that the smallest closed interval containing the support of for all is . The spectral gap of is (e.g. Geyer,, 1992; Roberts and Rosenthal,, 1997), the right spectral gap is and the left spectral gap is . is said to have a spectral gap (or a left or right spectral gap) if its spectral gap (or left or right spectral gap) is non-zero.
For any set with consider the probability of leaving at the next iteration given that the stationary chain is currently in :
[TABLE]
The conductance, for a Markov kernel with invariant measure is then (e.g. Lawler and Sokal,, 1988) (see also Jerrum and Sinclair,, 1988)
[TABLE]
For any reversible Markov chain we have the following relationships:
[TABLE]
These relationships will be used repeatedly in the sequel without further reference.
2.3 Example algorithms
To exemplify our theoretical results we will consider four specific, frequently-used MH algorithms.
The Metropolis-Hastings independence sampler (MHIS): . 2. 2.
The random walk Metropolis (RWM): ; e.g. . 3. 3.
The Metropolis-adjusted Langevin algorithm (MALA): . 4. 4.
The truncated MALA:
[TABLE]
for some .
In Proposals of type 2, 3 and 4, is often referred to as the scale parameter of the proposal. The MHIS and RWM have been used since the early days of MCMC (e.g. Tierney,, 1994); conditions under which they are geometrically ergodic (and, hence, variance bounding) have been well studied; see, for example, Liu, (1996) and Mengersen and Tweedie, (1996) for the MHIS and Mengersen and Tweedie, (1996), Roberts and Tweedie, 1996b and Jarner and Hansen, (2000) for the RWM. Essentially, for the MHIS the proposal, , must not have lighter tails than the target, and for the RWM the target must have suffiently smooth and exponentially decreasing tails. The MALA was introduced in Besag, (1994) and was analysed in Roberts and Tweedie, 1996a , in which the truncated MALA was also introduced. The MALA can be much more efficient than the RWM in moderate to high dimensions. As with the RWM, for geometric ergodicity the MALA requires exponentially decreasing tails, but if the tails decrease too quickly, grows too quickly and the MALA can fail to be geometrically ergodic. The truncated MALA circumvents this problem.
In Banterle et al., (2019) it is shown that the geometric ergodicity of an RWM algorithm need not be inherited by the resulting DA algorithm.
Example 1**.**
Banterle et al., (2019) Let with and . If , with then is not geometrically ergodic.
The following conditional equivalence (proved in Section A.2) is used throughout the sequel. If the parent kernel is geometrically ergodic then the DA kernel must have a left spectral gap, and with this constraint geometric ergodicity and variance bounding are equivalent.
Proposition 1**.**
Let be a MH kernel targeting as specified in (2). Let be the DA kernel derived from this through the approximation as in (4). If is geometrically ergodic then
[TABLE]
The original random walk Metropolis algorithm on is geometrically ergodic Mengersen and Tweedie, (1996), and hence variance bounding, so the DA kernel in Example 1 has not inherited its parent’s desirable properties. As a direct corollary of our Theorem 3 (see Section 4.2) we find that is exactly the right condition in this case:
Example 2**.**
Let with and . If , with then is variance bounding and geometrically ergodic.
Examples 1 and 2 suggest an intuition that problems may arise when has lighter tails than . As we shall see, this is a part of the story; however, in general, heavier tails are not sufficient to guarantee inheritance of the variance bounding property, and for a class of algorithms where heavy tails are sufficient, lighter tails can also be sufficient provided they are not too much lighter, in a sense we make precise.
3 Variance bounding: inheritance and equivalence
Throughout this section we use the following generic formulation for two Markov kernels.
Definition 1**.**
Let and be propose-accept-reject Markov kernels both targeting a distribution , and using, respectively, proposal densities of and and acceptance probabilities of and .
Theorem 1, below, follows from Lemma 1, which is proved in Section A.1.1. It generalises Corollary 12 of Roberts and Rosenthal, (2008) to allow for different acceptance probabilities and, more importantly, removes the need for a fixed, uniform minorisation condition. The minorisation needs only hold in a region such that under there is “unlikely” to be an accepted proposal in .
Lemma 1**.**
Let , , , , and be as in Definition 1, and let the conductances of and be and respectively. If and there is an and a such that for -almost all , there is a region such that
[TABLE]
and
[TABLE]
then .
If is variance bounding, ; choose an and for each a corresponding so as to satisfy (7) and (8) to obtain:
Theorem 1**.**
Let , , , , and be as in Definition 1. If is variance bounding and for any there is a such that for -almost all there is a region such that (7) and (8) hold, then is also variance bounding.
The relationship between conductance and right spectral gap has recently Lee and Latuszyński, (2014); Rudolph and Sprungk, (2016) been used in other contexts to bound the behaviour of one Markov kernel in terms of that of another. Lemma 1 itself shows that condition (7) need only hold for a single ; however, since in practice is unlikely to be known, the conditions of Theorem 1 are more practically useful.
From Section 4 we apply Theorem 1 to provide sufficient conditions for a delayed-acceptance kernel to inherit variance bounding from its Metropolis-Hastings parent. However, if a DA kernel is variance bounding then so is its parent MH kernel. Thus, the sufficient conditions in Section 4 imply an equivalence between the two kernels with respect to the variance bounding property. In this section, after two key definitions, we return, briefly, to this equivalence with regard to the variance bounding property and provide sufficient conditions for equivalence (over potential targets) between Metropolis-Hastings kernels arising from two different proposal densities.
The most natural special case of (7) in practice is where the kernel is uniformly local, which we define as follows:
Definition 2**.**
(Uniformly Local) A proposal is uniformly local if, given any ,
[TABLE]
A propose-accept-reject kernel is defined to be uniformly local when its proposal is uniformly local.
Here and throughout this article, is the open ball of radius centred on . In our examples, indicates the Euclidean norm, although the results are equally valid for other norms such as the Mahalanobis norm.
Control of the ratio will also be important and so we define the following.
Definition 3**.**
For any proposal density ,
[TABLE]
Clearly, the RWM is a uniformly local kernel; moreover . In contrast, on any target with unbounded support, the MHIS cannot be uniformly local; as we shall see, the behaviour of is then irrelevant. For the MALA and the truncated MALA we have:
Proposition 2**.**
*.
(A) Let be the proposal for the truncated MALA in (6) or for the MALA on a target where . Then
(i) For all , uniformly in as , so is uniformly local, as defined in (9).
(ii) .
(B) The proposal, , for the MALA on a target where is not uniformly local.*
The applicability of Lemma 1 and Theorem 1 ranges beyond delayed-acceptance kernels. Here we supply sufficient conditions for an equivalence between Metropolis–Hastings proposals.
Theorem 2**.**
Let , , , , and be as in Definition 1 except that and are uniformly local proposal kernels, with a continuous function from to . If, for -almost all and for some function with for all ,
[TABLE]
then is variance bounding if and only if is variance bounding.
Thus, for example, any two random-walk Metropolis algorithms with Gaussian jumps are equivalent, in that if, on a particular target, one is variance bounding then so is the other. When restricted to targets with a continuous gradient this equivalence extends to truncated MALA algorithms. The continuity requirement on rules out, for example, an equivalence between a Gaussian random walk and a random walk where the proposal has bounded support; indeed, the latter may not even be ergodic if the target has gaps in its support.
4 Application to delayed-acceptance kernels
4.1 Key definitions and properties
For uniformly local kernels we will describe two general sets of sufficient conditions for (8) to hold. The first is based upon the fact that the acceptance probability for can be written as
[TABLE]
where and are as defined in (3). So, if or then . The quantity measures the discrepancy between the error in the approximation at the proposed value and the error in the approximation at the current value. We name this intuitive quantity, the log-error discrepancy. The quantity is less natural since it relates and .
The second set of conditions is based upon the fact that if either and or if and then , whatever the log-error discrepancy.
These considerations lead to the natural definitions of a ‘potential problem’ set, , and a ‘no problem’ set , as follows:
[TABLE]
Theorem 1 then leads directly to the following.
Corollary 1**.**
Let be the Metropolis-Hastings kernel given in (2) and let be the corresponding delayed-acceptance kernel given in (4). Suppose that for all there is an such that for -almost all there exists a set such that
[TABLE]
and
[TABLE]
Subject to these conditions, if is variance bounding then so is .
When has heavier tails than then for large , the set can play an important role in the inheritance of the variance bounding property. In a dimension , there are numerous possible definitions of ‘heavier tails’. The following is precisely that required for our purposes:
Definition 4**.**
(heavy tails) An approximate density is said to have heavy tails with respect to a density if
[TABLE]
Intuitively, the left hand side is true when is ‘further from the centre’ (according to ) than , and the implication is that the further out a point, the larger is compared with .
For uniformly local kernels we show (Corollary 3) that it is sufficient that either the log error discrepancy should satisfy a growth condition that is uniform in , or (Theorem 3) that the tails of the approximation should be heavier than those of the target and that should satisfy a growth condition that is uniform in .
For all kernels, boundedness of the error away from [math] and will ensure the required inheritance (Corollary 2). This is a very strong condition, but we exhibit MHIS and MALA algorithms where the weaker conditions, that are sufficient for a uniformly local kernel, are satisfied, but the DA kernel is not variance bounding even though the MH kernel is.
4.2 DA kernels with the same proposal
distribution as the parent
Suppose that for all , , then , so applying Corollary 1 with and leads to:
Corollary 2**.**
Let and be as described in Corollary 1. If there exist and such that , and if is variance bounding then so is .
A more direct proof of Corollary 2 is possible using Dirichlet forms. However, Corollary 1 comes into its own when the error discrepancy is unbounded.
We first provide a cautionary example which shows that once the errors are unbounded the delayed-acceptance kernel need not inherit the variance bounding property from the Metropolis-Hastings kernel even if the growth of the log error discrepancy is uniformly bounded or if has heavier tails than .
Example 3**.**
Let , let be an MHIS with , and let be the corresponding delayed-acceptance kernel (4), with with and . is geometrically ergodic, but is neither geometrically ergodic nor variance bounding.
The problem with the algorithm in Example 3 is that for some values the proposal, , is very likely to be a long way from and yet . Our definition of a uniformly local proposal, (9), provides uniform control on the probability that is large. Since this is only strictly necessary for , (9) is stronger than necessary, but it is much easier to check.
Our first sufficient condition for uniformly local kernels insists on uniformly bounded growth in the log-error discrepancy except when . For -almost all and for some function with for all ,
[TABLE]
If a proposal is uniformly local, given find according to (9). Then (16) implies that for , . Applying Corollary 1 with leads to the following.
Corollary 3**.**
Let and be as described in Corollary 1. In addition let be a uniformly local proposal as in (9), and let the error discrepancy satisfy (16). If is variance bounding then so is .
Because most of the mass from the proposal, , is not too far away from the current value, , the discrepancy between the error at and the error at remains manageable provided the discrepancy grows in a manner that is controlled uniformly across the statespace. Since the random walk Metropolis on an exponential target density is geometrically ergodic Mengersen and Tweedie, (1996) we may apply Corollary 3 with , and then Proposition 1, to obtain the following contrast to Example 3, and showing that the variance bounding property can be inherited even when the approximation has lighter tails than the target.
Example 4**.**
Let and let be a RWM algorithm on using . For any , let be the corresponding delayed-acceptance RWM algorithm using a surrogate of . is variance bounding and geometrically ergodic.
As yet, the set has not played a part in any of our examples. It is precisely this set that allows a delayed-acceptance random walk Metropolis kernel to inherit the variance bounding property from its parent even when the error discrepancy is not controlled uniformly, provided has tails that are heavier than those of . For general MH algorithms an additional control on the behaviour of is enough to guarantee inheritance of the variance bounding property.
Theorem 3**.**
Let be the Metropolis-Hastings kernel given in (2) and let be the corresponding delayed-acceptance kernel given in (4). Further, let be a uniformly local proposal in the sense of (9), let and be continuous, and let have heavier tails than in the sense of (15). Suppose that, in addition, for any required by (13) and (14) there exists a function with for all , such that for -almost all
[TABLE]
Subject to these conditions, if is variance bounding then so is .
We now consider the delayed-acceptance versions of the random walk Metropolis, the truncated MALA, and the MALA. Before doing this we provide the details of a property that was anticipated in Roberts and Tweedie, 1996a .
Proposition 3**.**
Let be a random walk Metropolis kernel using and targeting a density . Let be a Metropolis-Hastings kernel on of the form , where \mbox{\pi-ess sup}_{x}\left|\left|{v(x)}\right|\right|<\infty. is variance bounding if and only if is variance bounding.
Proposition 3 clearly applies to a truncated MALA kernel on using as in (6). It, together with each of our subsequent results for the truncated MALA, also applies to a MALA kernel on a target where \mbox{\pi-ess sup}_{x}\left|\left|{\nabla\log\pi(x)}\right|\right|=D<\infty; in practice, however, the useful set of such kernels is limited to targets with exponentially decaying tails, since MALA is not geometrically ergodic on targets with heavier tails Roberts and Tweedie, 1996a .
Given Proposition 2 and its prelude, a direct application of Theorem 3 then leads to the following.
Example 5**.**
Let and be, respectively, a random walk Metropolis kernel and a truncated MALA kernel on the differentiable density, . Let and be the corresponding delayed-acceptance kernels, created as in (4) through the continuous density, . Suppose also that has heavier tails than in the sense of (15). Subject to these conditions, if is variance bounding then so is , and if is variance bounding then so is .
The MALA is geometrically ergodic when applied to one-dimensional targets of the form for Roberts and Tweedie, 1996a ; when geometric ergodicity occurs provided is sufficiently small, and for the MALA is not geometrically ergodic. Even when , however, Theorem 3 does not apply because the proposal is not uniformly local.
Example 6**.**
Let and let be a MALA algorithm on with . Let and let be the corresponding delayed-acceptance MALA kernel (4) (i.e. using a proposal of , where ). is neither geometrically ergodic nor variance bounding, except when .
The contrast between the truncated MALA and the MALA in Examples (5) and (6) highlights the importance of a uniformly local proposal. In practice, however, if is computationally expensive to evaluate then, typically, will also be expensive to evaluate and it might seem more reasonable to base the proposal for delayed-acceptance MALA and delayed-acceptance truncated MALA on .
4.3 Kernels where the proposal is based upon
On some occasions, the proposal is a function of the posterior, , and on such occasions it may be expedient for the delayed-acceptance algorithm to use a proposal , which is based upon . The acceptance rate is , where
[TABLE]
With , the corresponding delayed acceptance kernel is
[TABLE]
Let , , and, with , consider the hypothetical Metropolis-Hastings kernel:
[TABLE]
Now, , so if is not variance bounding then is not variance bounding either. There is an exact correspondence between from the previous section, and , and it is natural to consider inheritance of geometric ergodicity from exactly as in the prevous section we considered inheritance from . The theoretical results are analogous and will not be restated; moreover, the theoretical properties of kernels of the form are less well investigated. Instead we illustrate inheritance of variance bounding (or its lack) through two examples.
Example 7**.**
Let be, a truncated MALA kernel on the differentiable density, . Let be the corresponding delayed-acceptance kernel, created as in (18) through the differentiable density . inherits the variance bounding property from if either of the following conditions holds. (i) There is uniformly bounded growth in the log error discrepancy, in the sense of (16), or (ii) has heavier tails than in the sense of (15).
Our penultimate example suggests that a delayed-acceptance MALA based upon an approximation that has heavier (though not too much heavier) tails is a reasonable choice.
Example 8**.**
Let and let be a MALA algorithm on with . Let and let be the corresponding delayed-acceptance MALA kernel created as in (18) through the differentiable density . is variance bounding is geometrically ergodic .
We summarise the consequences of Examples 5 to 8 for and in Table 1, filling in the two blanks with Example 9 below. The table displays the results in terms of variance bounding, which is equivalent to geometric ergodicity in all these cases by Proposition 1.
Example 9**.**
Let and let be a RWM or truncated MALA algorithm on with . Let and let be the corresponding delayed-acceptance RWM or truncated MALA kernel created either from (4) or (18) through the differentiable density . If , is neither geometrically ergodic nor variance bounding.
5 Numerical demonstrations
The theoretical results from Section 4 were made more concrete through Examples 1 to 9. In this section we investigate the numerical performance of delayed acceptance algorithms in examples similar to those used in earlier sections. The specific targets in the earlier Examples were chosen to demonstrate particular points as simply as possible; here we deliberately investigate a broader class of targets, the exponential family class (e.g. Roberts and Tweedie, 1996a, ; Livingstone et al.,, 2019):
[TABLE]
The parameters and in (19) govern the lightness of the tails in the target and the approximation to it respectively, and allow us to vary these separately.
A lack of variance bounding can be seen in terms of the chain struggling to leave a certain region, which typically has a low probability under . In practice, this lack of variance bounding (or a lack of geometric ergodicity) can manifest in two ways.
When a sensible starting value is not known, a starting value with poor properties may be chosen unwittingly and the algorithm may struggle to move from this initial point or region of the space. 2. 2.
Even when started from a reasonable value, over the course of a sufficiently long run the algorithm will visit this “danger region” and then struggle to leave.
For the target (20), the “danger region” corresponds to the tails of .
Our experiments deliberately start the algorithm in the tails of and then measure the number of iterations to reach the centre of the distribution. To make “reaching the centre” concrete, we find the number of iterations until is less than its median value under . To decide where in the tails we start, we set to its quantile under , for in Scenarios (i) and (ii), and in Scenarios (iii) and (iv); we start the algorithm from a uniformly random point on the surface of that hypersphere. In practical MCMC, many runs are of iterations, so it is not unreasonable that issues which are detected for might occur in practice even when the algorithm is started from a sensible value. We work in dimension and repeat each experiment times, except for scenario (iii) where we repeat times to avoid excessive clutter.
We consider four specific scenarios, and so as to bound the amount of computing time, in each scenario we set a maximum number of iterations for which the algorithm should be run. In all scenarios the time until convergence increases with the starting quantile, whether or not the algorithm is variance bounding, for the most part simply because the algorithm is starting further from the main mass of the target. However, when the disparity between algorithms grows towards an order of magnitude, this suggests danger.
For the DARWM and DAMALA, the scaling parameter, , was chosen so that for the RWM or MALA itself, the acceptance rate was a little larger than the theoretical optimum values of approximately and respectively. DATMALA used the same scaling as DAMALA and a truncation value such that when TMALA explored the true posterior, fewer than of the gradients were truncated.
Scenario i (, and ). The results appear in the top-left of Figure 1 and demonstrate the undesirable behaviour when the target and the approximation are both Gaussian but the approximation has lighter tails than the target (see Example 1), and the reasonable behaviour when the approximation’s tails are less tight than the target’s (Example 2).
Scenario ii (, and ). The results, in the top-right of Figure 1, demonstrate that, in alignment with Example 3, the worst behaviour by some margin is exhibited by the only non-variance bounding algorithm: the independence sampler where uses a smaller scaling than has. In particular, aligning with Example 4, the DARWM that uses the same as the poor independence sampler performs only marginally worse than the DARWM which uses the notionally ‘safer‘ .
Scenario iii (, , and ). This corresponds to Examples 5, 6 and 9 and is consistent with the DARWM and DATMALA, but not DAMALA, being variance bounding when , and none being variance bounding when .
Scenario iv (, , and , proposal uses ) and suggests that as with Examples 7 and 8 , DAMALA and DATMALA are both variance bounding when , and following Example 9, neither is variance bounding when .
In scenarios (i), (iii) and (iv) the target itself has lighter-than exponential tails, so even though the x-axis is linear in it is sublinear in the magnitude of the initial value, . Hence, issues with the algorithms might be expected to appear more slowly as increases than they do with scenario (ii). Whilst exceptionally poor behaviour is unlikely to be seen, therefore, during a typical run that has been started from the main posterior mass, it could easily occur as a result of a poor starting value.
6 Discussion
Delayed acceptance Metropolis-Hastings algorithms are popular when the posterior is computationally intensive to evaluate yet a cheap approximation is available. Approximations can arise through many mechanisms, including the coarsening of a numerical-integration grid, subsampling from big data, Gaussian process approximation and nearest neighbour averaging. To date, with the exception of Franks and Vihola, (2020) and a note in Banterle et al., (2019), little consideration has been given to the properties of the resulting algorithm and, in particular as to whether the delayed-acceptance algorithm might inherit good properties, such as variance bounding, from its parent Metropolis-Hastings algorithm. From the MCMC output, one might reasonably hope to be able to estimate any quantity with a finite variance under and be confident that the Monte Carlo error would reduce in inverse proportion to the square-root of the run length; however, if the algorithm is not variance bounding then this may not be the case.
We have investigated the inheritance of the variance bounding property and provided sufficient conditions for it to occur. A general rule of thumb for algorithms with uniformly local (see Definition 2) proposals, such as the random walk Metropolis and the truncated MALA, is that the approximation should have heavier tails (see Definition 4) than the target; however, this is not always necessary (see Example 4). The MALA algorithm does not enjoy the same good properties as the truncated MALA and, in particular, does not necessarily inherit variance bounding even when the approximation does have heavier tails than the target (see Example 6).
A note of caution is also in order: variance bounding (and/or geometric ergodicity) are helpful properties as, in particular, they guarantee the existence of a usual central limit theorem for ergodic averages. However, whilst non-zero, the conductance of a kernel could be exceedingly small (or the geometric rate of convergence execptionally close to one) so that the algorithm might not be useful in practice. Thus, whilst we recommend following the advice in this article when choosing the approximation so as to reduce the chance of false confidence in the resulting Monte Carlo estimates, one should also continue to check other diagnostics, such as trace plots, and to vary any tuning parameters to optimise performance.
Acknowledgements: This paper was motivated by initial conversations with Alexandre Thiery about the preservation, and lack of preservation, of geometric ergodicity for delayed-acceptance kernels.
Data availability: Data sharing is not applicable to this article as no new data were created or analysed in this study.
Appendix A Proofs of results
A.1 Proofs of results in Section 3
A.1.1 Proof of Lemma 1
Since we may define such that . For any ,
[TABLE]
Integrating both sides over with respect to gives
[TABLE]
The result follows since only sets with are relevant.
A.1.2 Proof of Proposition 2
Let .
(A) (i) , where is a vector with iid components. So as .
(ii) Algebra shows that
[TABLE]
Since , , as required.
(B) Let . For any we may find with and for all . Hence, for ,
[TABLE]
So for any , , which can be made as close to as desired by taking to be sufficiently large.
A.1.3 Proof of Theorem 2
Since and is uniformly local, we may take , the closure of , where is chosen so as to satisfy (9) for -almost all , and with .
Next, let . Since is upper bounded by both [math] and , if then , and if then . Thus . Hence for ,
[TABLE]
The first term is bounded on since is continuous, and so (8) holds and we may apply Lemma 1. Repeat with .
A.2 Proofs of results in Section 2
A.2.1 Proof of Proposition 1
Since is geometrically ergodic, it must have a left spectral gap. From (5),
[TABLE]
where the Dirichlet form for the functional of the Markov chain is
[TABLE]
for a propose-accept-reject chain.
Since , , So if is geometrically ergodic, then . The result follows as all geometrically ergodic kernels are variance bounding and the only kernels which are variance bounding but not geometrically ergodic have no left-spectral gap, yet we have just shown that must have a left spectral gap because is geometrically ergodic.
A.2.2 Proof of Example 3
Since , the MH algorithm produces iid samples from and so it is geometrically ergodic (with a spectral gap of ) and, hence, variance bounding . For the DA algorithm,
[TABLE]
For any let
[TABLE]
For , , whilst for , . Also, , whilst . Therefore, for (so ) the flow out of satisfies
[TABLE]
So, for any such that , and the conductance of the chain is therefore [math]; the chain is not variance bounding. The lack of geometric ergodicity follows from Proposition 1.
A.3 Proofs of results in Section 4
A.3.1 Shorthand for delayed-acceptance kernels
The following short-hand is used through the remainder of this section.
[TABLE]
A.3.2 Proof of Theorem 3
For any , by (9), choose such that ; set so (14) holds.
The ‘heavier-tail’ condition (15) is equivalent to . Applying the identity and relabelling and then gives .
Next, suppose that , and but , so that and have opposite signs. By (17) , and the implications derived in the previous paragraph then imply that both and . Thus .
Finally, let be the closure of , let and let ; since is continuous and is compact, . For and , and so . For and , and and, from the previous paragraph, . Hence (13) holds with , and the result follows from the proof of Corollary 1.
A.3.3 Proof of Proposition 3
Let and the be proposal densities for the Metropolis-Hastings and RWM algorithms, respectively, let and be the corresponding acceptance probabilities, and let .
Firstly, if then ; since is uniformly local, so, therefore, is . Next, algebra shows that
[TABLE]
Now consider and apply the triangle inequality to obtain,
[TABLE]
Thus and . Also and, similarly, . Both implications then follow from Theorem 1 with and with chosen so that both and .
A.3.4 Proof of Example 7
Let be the RWM kernel using a Gaussian proposal and targeting , as in Proposition 3. Applying Proposition 3 twice shows that is variance bounding if and only if is variance bounding, which occurs if and only if is variance bounding. The sufficiency of (i) then arises directly from Corollary 3. For (ii), by Proposition 2 A(ii), applied to the proposal , we may use Theorem 3. Geometric ergodicity then follows from Proposition 1.
A.3.5 Proofs of Examples 6 and 8
Let the proposal be , where . Example 6 uses and Example 8 uses . Now , so, from (21),
[TABLE]
Also . Hence, for ,
[TABLE]
as , and where here, and throughout this proof indicates convergence in probability. Now
[TABLE]
However, if , as , so, for ,
[TABLE]
as . Finally,
[TABLE]
Example 6 (). Consider the behaviour of and as . If , is dominated by and so . If , is dominated by and is dominated by ; thus, when , , and when , either or , depending on the value of . In either case, as . Given , we choose such that for all and set , so that . But can be made as small as desired, so .
Example 8 (). When , as , is dominated by so , by an analogous argument to that for Example 6.
If then we note that the proof of geometric ergodicity of the MALA algorithm in Theorem 4.1 of Roberts and Tweedie, 1996a applies to any algorithm with a proposal of the form . For an irreducible and aperiodic kernel with a continuous proposal density, such as the one under consideration, geometric ergodicity is therefore guaranteed provided the following two conditions are satisfied:
[TABLE]
where is the interior and is the region where a rejection is possible. Now, so . We will show that if , for and , both and , so that and hence is empty.
Now , so if and then . Further, from (24),
[TABLE]
The final term is non-negative when . Directly from the concavity of , we obtain
[TABLE]
so the sum of the first two terms is also non-negative when . Hence, for , is empty, as claimed.
A.3.6 Proof of Example 9
Consider any proposal of the form , where and for all . Firstly,
[TABLE]
Secondly, as ,
[TABLE]
Also, for any ,
[TABLE]
The intermediate value theorem supplies: for some . Given any , set and choose such that for all . Then with probability at most , . Next,
[TABLE]
Hence, and dominates in probability as ; i.e., and in probability as .
After some algebra,
[TABLE]
Since , as (and hence, ), becomes large, we have
[TABLE]
in probability as .
Combining these two ideas, in probability. Thus and the algorithm cannot be geometrically ergodic by Theorem 5.1 of Roberts and Tweedie, 1996b ; by Proposition 1 it also cannot be variance bounding.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Banterle et al., (2019) Banterle, M., Grazian, C., Lee, A., and Robert, C. P. (2019). Accelerating metropolis-hastings algorithms by delayed acceptance. Foundations of Data Science , 1(2639-8001-2019-2-103):103.
- 2Besag, (1994) Besag, J. (1994). In discussion of ‘Representations of knowledge in complex systems’ by U. Grenander and M. Miller. J. Roy. Stat. Soc. Ser. B , 56:591–592.
- 3Christen and Fox, (2005) Christen, J. A. and Fox, C. (2005). Markov chain Monte Carlo using an approximation. J. Comp. Graph. Stat. , 14(4):795–810.
- 4Cui et al., (2011) Cui, T., Fox, C., and O’Sullivan, M. (2011). Bayesian calibration of a large-scale geothermal reservoir model by a new adaptive delayed acceptance metropolis hastings algorithm. Water Resources Research , 47(10).
- 5Franks and Vihola, (2020) Franks, J. and Vihola, M. (2020). Importance sampling correction versus standard averages of reversible MCM Cs in terms of the asymptotic variance. Stochastic Processes and their Applications . Early availability online.
- 6Geyer, (1992) Geyer, C. J. (1992). Practical Markov Chain Monte Carlo. Statistical Science , 7(4):473 – 483.
- 7Geyer, (2011) Geyer, C. J. (2011). Introduction to Markov chain Monte Carlo. In Brooks, S., Gelman, A., Jones, G. L., and Meng, X.-L., editors, Handbook of Markov chain Monte Carlo , Chapman & Hall/CRC Handbooks of Modern Statistical Methods, pages 3–48. CRC Press, Boca Raton, FL.
- 8Gilks et al., (1996) Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in practice . Chapman and Hall, London, UK.
