Variance reduction for additive functional of Markov chains via martingale representations
D. Belomestny, E. Moulines, S. Samsonov

TL;DR
This paper introduces a new variance reduction technique for additive functionals of Markov chains using a discrete-time martingale representation, improving efficiency without requiring ergodicity or stationary distribution knowledge.
Contribution
The paper presents a novel non-asymptotic variance reduction method for Markov chains based on martingale representations, applicable to MCMC methods.
Findings
Cost-to-variance ratio is improved over naive algorithms.
Method does not require stationary distribution or ergodicity.
Numerical tests show enhanced performance in Langevin MCMC.
Abstract
In this paper we propose an efficient variance reduction approach for additive functionals of Markov chains relying on a novel discrete time martingale representation. Our approach is fully non-asymptotic and does not require the knowledge of the stationary distribution (and even any type of ergodicity) or specific structure of the underlying density. By rigorously analyzing the convergence properties of the proposed algorithm, we show that its cost-to-variance product is indeed smaller than one of the naive algorithm. The numerical performance of the new method is illustrated for the Langevin-type Markov Chain Monte Carlo (MCMC) methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMarkov Chains and Monte Carlo Methods · Protein Structure and Dynamics · Theoretical and Computational Physics
\newaliascnt
lemmatheorem \aliascntresetthelemma
\newaliascntcorollarytheorem \aliascntresetthecorollary
\newaliascntpropositiontheorem \aliascntresettheproposition
\newaliascntdefinitiontheorem \aliascntresetthedefinition
\newaliascntdefinitionPropositiontheorem \aliascntresetthedefinitionProposition
\newaliascntremarktheorem \aliascntresettheremark
Variance reduction for additive functional of Markov chains via martingale representations
D. Belomestny 111Duisburg-Essen University, Germany, and HSE University, Russia, [email protected]. , E. Moulines 222Ecole Polytechnique, France, and HSE University, Russia, [email protected]., and S. Samsonov 333HSE University, Russia, [email protected].
Abstract
In this paper we propose an efficient variance reduction approach for additive functionals of Markov chains relying on a novel discrete time martingale representation. Our approach is fully non-asymptotic and does not require the knowledge of the stationary distribution (and even any type of ergodicity) or specific structure of the underlying density. By rigorously analyzing the convergence properties of the proposed algorithm, we show that its cost-to-variance product is indeed smaller than one of the naive algorithm. The numerical performance of the new method is illustrated for the Langevin-type Markov Chain Monte Carlo (MCMC) methods.
1 Introduction
Markov chains and Markov Chain Monte Carlo (MCMC) algorithms play a crucial role in modern numerical analysis, finding various applications in such research areas as Bayesian inference, reinforcement learning and online learning. As an illustration, suppose that we aim at computing , where is a function in and has a smooth and everywhere positive density w.r.t the Lebesgue measure (By abuse of notation, we use the same notation for the probability measure and its density with respect to the Lebesgue measure). Typically it is not possible to compute analytically, and a common solution is to use approximations based on Monte Carlo methods. Given independent identically distributed observations from , we might estimate by . The variance of such estimate equals with being the variance of the integrand with respect to . The first way to obtain a tighter estimate is simply to increase the sample size . Unfortunately, this solution might be prohibitively costly, especially when the dimension is large enough and sampling from is complicated. An alternative approach is to decrease by constructing a new Monte Carlo experiment with the same expectation as the original one, but with a lower variance. Such methods are known as variance reduction techniques. Introduction to many of them can be found in Rubinstein and Kroese [2016], Gobet [2016], Glasserman [2013].
One of the popular approaches to variance reduction is the control variates method (see [South et al., 2021] and the references therein). It aims at constructing a cheaply computable random variable (control variate) with and , such that the variance of the random variable is small, where . One of the main difficulties here is to construct a class of control variates satisfying . The complexity of this problem essentially depends on the degree of our knowledge on . For example, if is analytically known and satisfies some regularity conditions, one can apply the well-known technique of polynomial interpolation to construct control variates enjoying some optimality properties, see, for example [Dimov, 2008, Section 3.2]. Alternatively, if an orthonormal system in is analytically available, one can build control variates as a linear combination of the corresponding basis functions, see [Ben Zineb and Gobet, 2013]. Furthermore, if is known only up to a normalizing constant (which is often the case in Bayesian statistics), one can apply the recent approach of constructing control variates depending only on the gradient using either a Schrdinger-type Hamiltonian operator in [Assaraf and Caffarel, 1999, Mira et al., 2013], or the Stein operator in [Brosse et al., 2018]. In some situations is not known analytically, but can be represented as a function of simple random variables with known distribution. Such situation arises, for example, in the case of functionals of discretized diffusion processes. In this case a Wiener chaos-type decomposition can be used to construct control variates with nice theoretical properties, see [Belomestny et al., 2018]. Note that in order to compare different variance reduction approaches, one has to analyze their complexity, that is, the number of numerical operations required to achieve a prescribed magnitude of the resulting variance.
Unfortunately, it is not always possible to generate independent observations distributed according to . To overcome this problem one might consider MCMC algorithms, where the exact samples from are replaced by forming a Markov chain with a marginal distribution of converging to in a suitable metric as goes to infinity. It is still possible to apply the control variates method in a similar manner to the plain Monte Carlo case, yet the choice of the optimal control variate becomes much more involved. Due to significant correlations between the elements of the Markov chain, it might be not enough to minimize the marginal variances of as it was in independent case. Instead one may choose the control variate by minimizing the corresponding asymptotic variance of the chain as it is suggested in Belomestny et al. [2020]. At the same time it is possible to express the optimal control variate in terms of the solution of the Poisson equation for the corresponding Markov chain . As it was observed in Henderson [1997], Henderson and Simon [2004], for a time-homogeneous Markov chain with a stationary distribution , the function has zero mean with respect to for an arbitrary real-valued function , such that . Hence, is a valid control functional for a suitable choice of , with the best given by a solution of the Poisson equation
[TABLE]
For such we obtain leading to an ideal estimator with zero variance. Despite the fact that the Poisson equation involves the quantity of interest and can not be solved explicitly in most cases, this idea still can be used to construct some approximations for the optimal zero-variance control variates. For example, Henderson [1997] proposed to compute approximations to the solution of the Poisson equation for specific Markov chains with particular emphasis on models arising in stochastic network theory. In Dellaportas and Kontoyiannis [2012] and Brosse et al. [2018] series-type control variates are introduced and studied for reversible Markov chains. It is assumed in Dellaportas and Kontoyiannis [2012] that the one-step conditional expectations can be computed explicitly for a set of basis functions. Brosse et al. [2018] proposed another approach tailored to diffusion setting which does not require the computation of integrals of basis functions and only involves applications of the underlying generator. For more information on diffusion based algorithms we refer reader to the recent works [Dalalyan, 2017, Durmus and Moulines, 2017, Lemaire, 2007, Pagès and Panloup, 2018]. Another family of variance reduction techniques aims at constructing a parametric class of control variates with zero mean with respect to the ergodic measure . A popular choice is Stein control variates (see Belomestny et al. [2020], Mira et al. [2013], Oates et al. [2016], South et al. [2018, 2021] and references therein).
In this paper we propose a generic variance reduction method for additive functionals of Markov chains. Compared to Stein control variates techniques, the knowledge of the stationary distribution is not required. The variance reduction method we propose thus applies not only to MCMC methods (for which the distribution is known), but also to the more general setting in which the stationary distribution is not analytically known; such examples arise, in particular, when one wishes to integrate according to the stationary distribution of an ergodic diffusion or to estimate the value function (or the gradient of the value function) in reinforcement learning algorithms.
Compared to Dellaportas and Kontoyiannis [2012], our approach is not restricted to -reversible Markov kernels. We provide a non-asymptotic analysis for the so-called normal noise model, which covers as a special example the Langevin dynamics. We also consider variance reduction in the problem of estimating the expectation of functions under the unknown stationary distribution of ergodic diffusion process.
The paper is organized as follows. In Section 2 we set up the problem and introduce some notations. In Section 3, we outline the construction of a novel martingale representation. In Section 4 we show how this martingale representation can be used to construct control variates. In Section 5 we analyze performance of the proposed variance reduction algorithm in case of the Markov chain, driven by the normal noise (see Section 5 for the precise definition). Finally, in Section 6 we illustrate our findings on different numerical examples.
2 Setup
Our aim is to numerically compute expectations of the form
[TABLE]
where and is a probability measure supported on equipped with its Borel -field. If is large and can not be computed analytically, one can apply Monte Carlo methods. However, in many practical situations direct sampling from is impossible and this precludes the use of plain Monte Carlo methods in this case. One popular alternative to Monte Carlo is Markov Chain Monte Carlo (MCMC) where one is looking for a discrete time (possibly non-homogeneous) Markov chain such that is its unique invariant measure. In this paper we study a class of MCMC algorithms with satisfying the following recurrence relation:
[TABLE]
for some i.i.d. random vectors with distribution and some Borel-measurable functions In fact, this is quite general class of Markov chains (see Douc et al. [2018, Theorem 1.3.6]) and many well-known MCMC algorithms can be represented in the form (2). Let us consider two popular examples.
Example 2.1** (Metropolis-Adjusted Langevin Algorithm).**
The Metropolis-Hastings algorithm associated with a target density requires to choose a proposal transition density . The Markov chain is constructed as follows:
Given the previous state , we generate a proposal 2. 2.
Accept the proposal with probability where
[TABLE]
Otherwise, set .
This transition is reversible with respect to and therefore preserves the stationary density ; see [Douc et al., 2018, Chapter 2]. If has a wide enough support to eventually reach any region of the state space with positive mass under , then this transition is irreducible and is a maximal irreducibility measure Mengersen and Tweedie [1996]. The Metropolis-Adjusted Langevin algorithm (MALA) takes (6) as proposal, that is,
[TABLE]
with is a density of the standard normal random variable. It is not difficult to see that the MALA chain can be compactly represented in the form
[TABLE]
where is an i.i.d. sequence of uniformly distributed on random variables independent of Thus, we recover (2) with and
[TABLE]
Example 2.2**.**
Let be the unique strong solution to SDE of the form:
[TABLE]
where and are locally Lipschitz continuous functions with at most linear growth. The process is a Markov process and let denote its infinitesimal generator defined by
[TABLE]
for any If there exists a twice continuously differentiable Lyapunov function such that
[TABLE]
then there is an invariant probability measure Invariant measures are crucial in the study of the long term behaviour of stochastic differential systems (3). Under some additional assumptions, the invariant measure is ergodic and this property can be exploited to compute the integrals for by means of ergodic averages. The idea is to replace the diffusion by a (simulable) discretization scheme of the form (see e.g. [Pagès and Panloup, 2018], [Lamberton and Pagès, 2002])
[TABLE]
where and is a non-increasing sequence of time steps. Then for a function we can approximate via
[TABLE]
Due to typically high correlation between , variance reduction is of crucial importance here. As a matter of fact, in many cases there is no explicit formula for the invariant measure and this makes the use of the Stein control functions (see e.g. [Mira et al., 2013, Oates et al., 2017]) impossible in this case.
If for some continuously differentiable function , and , the Markov chain (4) can be used to approximately sample from the density
[TABLE]
provided that . This method is usually referred to as Unadjusted Langevin Algorithm (ULA). In practice, a constant step-size discretization
[TABLE]
is often considered, where is an i.i.d. sequence of -dimensional standard Gaussian random vectors. Note that the invariant distribution of the chain (6) is in general different from and is not available analytically, although converges to when , see Mattingly et al. [2002], Durmus and Moulines [2017]. Hence the methods based on the Stein control variates will introduce additional bias when applied to (6).
3 Martingale representation
In this section we provide a general discrete-time martingale representation for Markov chains of type (2) which is used later to construct an efficient variance reduction algorithm. Let be a complete orthonormal system in with . In particular, we have
[TABLE]
with Notice that this implies that the random variables , , are centered. As an example, we can take multivariate Hermite polynomials for the ULA algorithm and a tensor product of shifted Legendre polynomials for ”uniform part” and Hermite polynomials for ”Gaussian part” of the random variable in MALA, as the shifted Legendre polynomials are orthogonal with respect to the Lebesgue measure on
Let be i.i.d. dimensional random vectors with distribution . We denote via the filtration generated by with the convention . Let be a measurable function. Set for and ,
[TABLE]
with the functions defined as
[TABLE]
with the convention . Note that for any bounded measurable function , any and , it holds
[TABLE]
We write and as a shorthand notation for and , respectively. We formulate the results below for bounded measurable functions, but these results can be easily extended to unbounded functions at the expense of using classical drift conditions to control the moments. For simplicity and readability, we leave this elementary extension to the reader.
Theorem 1**.**
For any , any , any Borel bounded functions and the following representation holds in
[TABLE]
where is given by (7) and for any ,
[TABLE]
Proof.
The proof is postponed to Section 7.2. ∎
Corollary \thecorollary.
Assume that , for all . Then for any , , a bounded measurable function, and , it holds in \mathrm{L}^{2}\bigl{(}\mathbb{R}^{mq},P^{\otimes q}_{\xi}\bigr{)}
[TABLE]
where for all ,
[TABLE]
Discussion
The representation (9) is remarkable for two reasons. First, it suggests a general way of constructing zero-mean random variables adapted to the filtration Indeed any random variable of the form
[TABLE]
for some measurable functions has zero mean (conditional on ) and is adapted to Second, it shows that for coefficients defined in (10) the representation (9) computes exactly that is, the control variate
[TABLE]
is perfect and leads to zero variance when computing by Monte Carlo. Another equivalent representation of the coefficients turns out to be more useful in practice.
Proposition \theproposition.
Let . Then the coefficients in (10) can be alternatively represented as
[TABLE]
with In the homogeneous case , the coefficients in (11) are given respectively for all by
[TABLE]
We now show how the representation (9) can be used to construct variable for of additive functionals of Markov chains. For the sake of clarity, in the sequel, we consider only the time homogeneous case ( for all ). For a bounded measurable function, denote , where is the number of samples. To avoid overloading the notations, the dependence in the initial condition is removed when it can be inferred from the context For any , , and , set
[TABLE]
Section 3 applied with implies that for any ,
[TABLE]
Since is independent of , is measurable, and , we get for any measurable function with that . This implies that for any , \bigl{(}M^{x}_{p,k}\bigr{)}_{p=1}^{\infty} is a square-integrable martingale sequence with respect to filtration and hence that, for any and , . In addition, since if , we obtain for any ,
[TABLE]
Fix some and denote
[TABLE]
The expansion (15) suggests to consider the following estimator
[TABLE]
By construction, for any and , as Moreover, we obtain
[TABLE]
Hence we expect to be small, provided that decay fast enough as .
If the empirical mean estimator is convergent in quadratic mean, the same is true for . This is formalized in the following result. Denote by the Markov kernel of the Markov chain (2), defined for any bounded measurable function by .
Proposition \theproposition.
Assume that the Markov kernel has a unique invariant probability measure and that for any bounded measurable function , and ,
[TABLE]
Then, for any ,
[TABLE]
The proof of Section 3 is an elementary consequence of (19) and is left to the reader. A direct consequence is that if the sequence of estimator is consistent in quadratic mean then is also consistent in quadratic mean. For any and , and the variance of is always smaller than that of .
Below is a simple illustrative example showing that can be much smaller than even for .
Example 3.1**.**
Suppose that we aim at sampling from the Gaussian distribution with zero mean and variance with the density using the ULA algorithm (see 2.2 and equation (6)). We consider the Markov chain given by
[TABLE]
where is an i.i.d. sequence of normally distributed random variables with zero mean and unit variance. The invariant distribution of this Markov chain is Gaussian with zero mean and variance . As a complete orthogonal system in , we consider the normalized Hermite polynomials on , that is,
[TABLE]
Consider now the problem of estimating for . Note that
[TABLE]
and by recalling the definition of the Hermite polynomials, we arrive at the martingale representation
[TABLE]
where for and ,
[TABLE]
We stress that the coefficients do not depend on in this special example. Due to (14) and (23), we can represent as
[TABLE]
where, for any and ,
[TABLE]
The decomposition (25) implies that
[TABLE]
*Hence, given that , we estimate *
[TABLE]
which does not depend upon . On the other hand, \mathsf{Var}\left(\pi^{x}_{n}(f)\right)=n^{-2}\sum_{i,j}\mathsf{Cov}{\bigl{(}f(X_{i}^{x}),f(X_{j}^{x})\bigr{)}}, and since and are Gaussian random variables, application of the Isserlis formula yields
[TABLE]
Using the identity \mathsf{Cov}{\bigl{(}X_{i}^{x},X_{j}^{x}\bigr{)}}=\gamma\sum_{k=1}^{i\wedge j}(1-\gamma)^{i+j-2k}\geq(1/2)\bigl{(}(1-\gamma)^{|i-j|}-(1-\gamma)^{i+j}\bigr{)}, we get
[TABLE]
Thus, for , \mathsf{Var}\left(\pi^{(x,1)}_{n}(f)\right)\Bigl{/}\mathsf{Var}\left(\pi^{x}_{n}(f)\right)\leq 4\gamma, and the variance reduction effect is large when . To make the variance reduction effect clear, we plot the ratio \mathsf{Var}\left(\pi^{x}_{n}(f)\right)/\mathsf{Var}\bigl{(}\pi_{n}^{(x,1)}(f)\bigr{)}, computed according to (26) with , and different values of the step size . The corresponding plots are provided in Figure 1. It illustrates first that the gain in variance indeed scales linearly in for small enough, and, second, that even for moderate values of , estimate is preferable in terms of variance. ∎
4 Martingale Decomposition Control Variate (MAD-CV) algorithm
We now describe an algorithm to estimate the martingale introduced in (17). To keep the computational complexity at a reasonable level, we estimate a fixed number of coefficients for , where the truncation index does not depend on . The corresponding martingale then is written as where
[TABLE]
and we consider the truncated version of the estimator (18):
[TABLE]
The truncation leads to an increase of the variance. More precisely, proceeding as in (19), we obtain
[TABLE]
where is defined in (14). To illustrate the effect of truncation on the variance reduction, we compare and using the simple 3.1.
**Example 3.1 **(continued).
We now consider the estimate where is the truncated martingale
[TABLE]
where , is defined in (24). Then, proceeding as in 3.1, we obtain
[TABLE]
where is defined in (25). Plugging (24) in the previous identity, we get
[TABLE]
Setting , we obtain for , yielding the same (up to a constant factor) variance reduction factor, that is,
[TABLE]
Here we write and for inequality up to a constant not depending on and .
The last step to define an estimator of the coefficients . In the previous example, the calculation of these coefficients is explicit, but this is obviously not the case in general. We propose to use the representation outlined in (13) of the functions . This representation suggests to first approximate the -th step predictor , , for . For that purpose, we consider a parametric family of functions from to , denoted , where . There are many ways to define such family of functions. The simplest idea is to select a family of functions , and to set
[TABLE]
However, this is not necessarily the best choice when the prediction functions have a specific structure. For example, for Metropolis-Hastings algorithms, the sampling step uses an accept/reject step. In such case, it is more appropriate to consider predictors of the form
[TABLE]
In this decomposition, estimates the -th step rejection probability, i.e. the probability of observing successive rejection. This probability can be estimated by logistic regression.
The parameter vector are estimated via the least-squares approach. More precisely, for , i.e. we solve
[TABLE]
Finally, we compute the estimates of the functions (see (13)). Namely, for all we define
[TABLE]
where is defined in (2). Note that in some relevant cases (e.g. for being linear in and the regression function being a linear combination of basis functions as in (30)), the expectation in (33) can be computed in closed form. When direct integration is not an option, we use Monte Carlo or Quasi Monte Carlo to compute the integrals . The complexity of this parametric integration problem is well studied. In order to increase efficiency, one can also employ the Multilevel Monte Carlo approach (see Heinrich and Sindambiwe [1999]). The estimator obtained by plugging (33) into (27) and (28) is referred to as the MAD-CV (MArtingale Decomposition Control Variate) estimator.
The resulting estimate
[TABLE]
with
[TABLE]
remains unbiased for (if computed on a new trajectory independent of regression data) and has a variance
[TABLE]
While the first and the second terms will be studied in Section 5 in some special cases, the last one has to be analyzed separately for different approximation schemes, and we leave this analysis for future research. Nevertheless already this decomposition shows that under the conditions of Proposition 3, it holds for a fixed
[TABLE]
provided that the expectations are uniformly bounded for in and . The latter property of the estimates can be achieved by using an additional truncation step in regression, see e.g. Györfi et al. [2006] for various truncation schemes.
5 Gaussian noise model
We analyze the MAD-CV algorithm for the Markov chains driven by a normal noise, that is,
[TABLE]
For a multi-index , we denote by the normalized Hermite polynomial on , that is, with defined in (22). The following notations are used in the sequel: , and . In this case , and the martingale takes the form
[TABLE]
Recall that , , and . For a twice differentiable function , we denote by its Hessian at point . For a smooth function , a multi-index , we use the notation for the partial derivative
[TABLE]
For , a smooth function with arguments being denoted , , , a multi-index , and , we use the notation for the multiple derivative of with respect to the components of :
[TABLE]
where and stands for partial derivative of order for the function with respect to the th - coordinate of the vector For and , we denote
[TABLE]
By setting , where is defined in (8), we obtain .
We preface the derivations with an auxiliary lemma which provides us with a representation for the coefficients defined in (13). Let and consider the following assumptions
H 1**.**
The function is times continuously differentiable.
H 2**.**
The function is times continuously differentiable.
Lemma \thelemma.
Assume H 1 and H 2 and that (36) holds. Then for any such that componentwise and , any , the following representation holds
[TABLE]
Proof.
The proof is postponed to Section 7.3. ∎
Under some additional smoothness assumptions one can derive a useful bound for the sum of the functions which can be directly used to bound the variance of additive functionals (19).
Proposition \theproposition.
Assume H 1, H 2, and that (36) holds. Then for any , it holds
[TABLE]
where for a non-empty subset , we denote .
Proof.
The proof is postponed to Section 7.4. ∎
We aim at applying our main result to the estimation of expectations under the stationary distribution of ergodic diffusion processes. Let be a drift function, be a dimensional Wiener process and assume that the stochastic differential equation
[TABLE]
admits a unique strong solution for any . We consider the Euler-Maruyama discretization of the SDE (40), i.e. the homogeneous Markov chain , starting from and defined by the following recursion: for any ,
[TABLE]
where is a stepsize and is a sequence of i.i.d. dimensional Gaussian random variables with zero mean and identity covariance matrix. Note that the recurrence (41) is a particular case of the general scheme (36) with . We impose some standard technical conditions on the drift function , following Bortoli and Durmus [2020], namely,
H 3**.**
There exist a constant , such that for any .
H 4**.**
There exist a constant , such that for any .
Under the assumptions H 3 and H 4, Section 5 can be used to bound the variance of additive functionals of the Markov chains of the form (41).
Theorem 2**.**
Let be a Markov chain given by the recurrence (41), and assume that H 2, H 3, and H 4 hold. Let . Assume in addition that there exist constants and , such that for any , any multi-index with , and any ,
[TABLE]
Then, for and any ,
[TABLE]
Moreover, with the truncation point , variance of the truncated estimate can be bounded as
[TABLE]
where stands for inequality up to a constant not depending on and .
Proof.
The proof is postponed to Section 7.5. ∎
Theorem 2 shows that under some conditions the variance of the estimate in the diffusion case (41) satisfies
[TABLE]
At the same time, the variance of the standard Monte Carlo estimate is of order and this order can not be reduced in general, see Example 3.1. Thus, for and small enough we have a clear variance reduction effect.
Remark \theremark.
In the particular case of the Unadjusted Langevin algorithm (Example 2.2), assumptions of the Proposition 2 can be verified for the smooth and strongly convex potential , that is, for and
[TABLE]
for some , and any .
6 Numerical experiments
In this chapter we evaluate our MAD-CV control variates on different model examples. Code to reproduce the experiments is available at https://github.com/svsamsonov/MAD-CV.
6.1 Example 3.1 (continue)
In this subsection we complete the Example 3.1 by evaluating the estimator . Recall that we consider samples from the Gaussian distribution with density using the ULA algorithm (see (21)) and take . We use different step sizes and sample training trajectory of length for each step size. We solve the least squares problems (32) with basis and the truncation points . Then we construct the control variate defined in (35). Our goal is to compare the variance of the truncated estimator to the variance of the ”perfect” (with exact coefficients ) estimator . To this end, we we show two quantities. The first one is the ratio \mathsf{Var}\left[\pi^{x}_{n}(f)\right]\Bigl{/}\mathsf{Var}\left[\pi^{(x,1)}_{n,n_{0}}(f)\right], which can be computed analytically for different . The second one is the sample counterpart of the ratio \mathsf{Var}\left[\pi^{x}_{n}(f)\right]\Bigl{/}\mathsf{Var}\left[\hat{\pi}^{(x,1)}_{n,n_{0}}(f)\right], computed over independent replications of test trajectories, each of length . Left panel of Figure 2 contains error bars for \mathsf{Var}\left[\pi^{x}_{n}(f)\right]\Bigl{/}\mathsf{Var}\left[\hat{\pi}^{(x,1)}_{n,n_{0}}(f)\right] and indicates that the use of regression does not lead to a significant drop in the algorithm’s performance.
Next we aim to illustrate the results of Theorem 2 by comparing \mathsf{Var}\bigl{[}\hat{\pi}^{(x,1)}_{n,n_{0}}(f)\bigr{]} to \mathsf{Var}\bigl{[}\hat{\pi}^{(x,2)}_{n,n_{0}}(f)\bigr{]}. We fix and use different step sizes . The least squares problems (32) are solved with the regressors . Following Theorem 2, we set the truncation points for and for , respectively. We compute the sample variance reduction factors \mathsf{Var}\bigl{[}\pi^{x}_{n}(f)\bigr{]}\Bigl{/}\mathsf{Var}\bigl{[}\hat{\pi}^{(x,K)}_{n,n_{0}}(f)\bigr{]},K=1,2 over independent trajectories of length , repeat this procedure times and report the averaged variance reduction factors in the upper right panel of Figure 2. To highlight the gain of the estimate , we report on the lower right panel of Figure 2 the averaged ratios \mathsf{Var}\bigl{[}\hat{\pi}^{(x,1)}_{n,n_{0}}(f)\bigr{]}/\mathsf{Var}\bigl{[}\hat{\pi}^{(x,2)}_{n,n_{0}}(f)\bigr{]}. Note that they scale approximately as , as predicted by Theorem 2.
6.2 Comparison with vanilla ULA
We compare the variance reduction versus cost achieved by MAD-CV against plain Monte Carlo for the ULA algorithm. We consider samples generated by ULA, where is either the standard normal distribution in dimension or the mixture of two dimensional standard Gaussian distributions of the form
[TABLE]
We fix and . In both examples, our goal is to estimate with and . We use a constant step size and sample training trajectory of length with the starting point . Then we solve the least squares problems (32) with the class of regressors for the different choices of truncation point . We construct the control variate , defined in (37). We finally estimate the cost-to-variance ratio (degree of variance reduction relative to costs) as follows
[TABLE]
by its empirical counterpart, computed over independent trajectories, each of length . Note that for dimensional standard Gaussian vector , for multi-indices it holds that
[TABLE]
for any . This implies that the coefficients , for any and . Since for a fixed the cost of computing is proportional to the cost of computing function , we set for
[TABLE]
since for the fixed , each coefficient is a polynomial, which can be computed at the same cost as . Similarly, for we set
[TABLE]
since we need to evaluate coefficients in addition to each evaluation of . Variance reduction costs for Gaussian distribution and different truncation points are summarized in Figure 3, and for the Gaussian mixture - in Figure 4. Note that for both examples MAD-CV allows us to obtain a significant gain in terms of cost-to-variance ratios.
6.3 Random Walk Metropolis (RWM) example
We illustrate the application of MAD-CV to the RWM algorithm. RWM is an MCMC algorithm using random walk proposal. Let and be independent i.i.d. sequences, with and . Then the -th RWM iterate writes as
[TABLE]
where \alpha(x,y)=\min\bigl{\{}1,\pi(y)/\pi(x)\bigr{\}} is the acceptance ratio. In this experiment is set to be the standard normal distribution in dimension . We aim to estimate with , using RWM. The variance of the incremental distribution is determined by , which leads to an acceptance rate in stationarity of approximately . We sample a training trajectory of length , and solve the regression problem (32) with the polynomial regressors , . We illustrate that MAD-CV can benefit when taking into account both randomness in and . Namely, for a multi-index , and , we consider basis functions
[TABLE]
with being shifted Legendre polynomials on . We use QMC to evaluate the corresponding functions in (33). We write for the version of estimator (34) based on the coefficients , .
To test our variance reduction algorithm, we generate independent trajectories of length . We use Hermite polynomials in each coordinate (that is, ). The compared cases are when the MAD-CV are only applied to the proposal (44) (that is, ), and when the MAD-CV are applied jointly on the proposal and the acceptance step (). Figure 5 displays the boxplots of the estimates together with the estimated standard deviations of the corresponding estimates for different truncations . Note that combining Legendre and Hermite polynomials allows to achieve better variance reduction compared to the case when only Hermite polynomials are used.
6.4 Euler scheme for discretized diffusion
We consider the dimensional stochastic differential equation
[TABLE]
with the drift function
[TABLE]
We aim at estimating for the functions and , where is an ergodic distribution of (46). We fix , , and consider the Euler-Maruyama discretization of (46) with constant stepsize , and approximate by and its MAD-CV counterparts. Note that the assumptions of Theorem 2 are satisfied. We consider estimators with and or . We refer to them as to MAD-CV-1 and MAD-CV-2, respectively. First we sample a training trajectory of length , and solve the regression problem (32) with the class of regressors for . To test our variance reduction algorithm, we generate independent test trajectories of length and compute and . The corresponding boxplots are presented in Figure 6.
6.5 Lotka-Volterra model with feedback control
We consider the stochastic Lotka-Volterra predator-prey model with feedback control, following Liu and Zhao [2019]:
[TABLE]
where denote independent Wiener processes, parameters correspond to intraspecific competition rates, stand for capturing rates of the prey and predator, represent the intrinsic growth rate of the population and . We consider Euler-Maruyama discretisation of the equation (47) with step size , and fix the hyperparameters
[TABLE]
Note that the assumptions and from Liu and Zhao [2019], namely and are satisfied, and the system (47) oscillates around its equillibrium point. We fix or , and aim at approximating by and its variance-reduced counterparts. We sample a training trajectory of length , and solve the regression problem (32) with the class of regressors for . We set the truncation level and generate independent test trajectories of length . For each trajectory we compute and its variance-reduced counterpart . We show the simulated trajectories of the system (47) in Figure 8a, and the boxplots corresponding to in Figure 8b.
6.6 Multidimensional stochastic Lotka-Volterra model
Following Mao et al. [2003], we consider Lotka–Volterra model for a system with interacting components, corresponding to the case of facultative mutualism, namely
[TABLE]
where
[TABLE]
Stochastically perturbing parameters , we come up with the system
[TABLE]
where denote independent Wiener processes and is a matrix with and . We consider the Euler-Maruyama discretisation of the equation (48) with constant step size , and fix the hyperparameters
[TABLE]
We also set . Note that the conditions of [Mao et al., 2003, Theorem 1] are satisfied, and the system (47) has a unique positive solution. We fix , and aim at approximating by and its variance-reduced counterparts. We sample a training trajectory of length , and solve the regression problem (32) with the class of regressors for . We set the truncation level and generate independent test trajectories of length . For each trajectory we compute and its variance-reduced counterpart . We show the simulated trajectories of the system (47) in Figure 8a, and the boxplots corresponding to in Figure 8b.
7 Proofs
7.1 Notations.
For multi-indices and , such that , , for and smooth function , such that
[TABLE]
we write
[TABLE]
We write for the Jacobian of at point , and for the matrix of partial derivatives . For the multi-indices , we write that , if one of the following holds:
- •
;
- •
and , or
- •
, , …, , and .
For the multi-indices we define the set of multi-indices , such that for some , and for , for and are such that
[TABLE]
7.2 Proof of Theorem 1
The expansion obviously holds for any and . Indeed, since is a complete orthonormal system in , it holds in that
[TABLE]
for any bounded with . Assume now that (9) holds for any , and bounded measurable functions . Let us prove that the induction assumption holds for and any . Denote for and ,
[TABLE]
The orthonormality and completeness of the system implies that
[TABLE]
The Parseval inequality implies that
[TABLE]
By construction, and -a.s. Hence, using (53) and (54), we get that
[TABLE]
or equivalently
[TABLE]
in which is the required statement in the case and . Consider now the case and . Set . Note that -a.s. it holds and is bounded by construction. Hence, we may apply the induction hypothesis to function the bounded measurable function, which implies
[TABLE]
with . Using that , and , w
[TABLE]
Eqs.(55) and (56) conclude the induction step for and all and hence the proof.
7.3 Proof of Section 5
Applying the integration by parts in vector form (below whenever ),
[TABLE]
The last expression yields the result.
7.4 Proof of Section 5
For multi-indices with componentwise and , , we obtain from Lemma 5, that for ,
[TABLE]
where is defined in (14). Given , by taking , we get
[TABLE]
where for any two multi-indices from we have defined
[TABLE]
In (57) the first sum runs over all nonempty subsets of the set For any subset stands for a set of multi-indices with elements and Moreover, and stands for a set of multi-indices with elements and . Applying the estimate
[TABLE]
we get
[TABLE]
The Parseval identity implies that for any function satisfying ,
[TABLE]
Using this identity in (58) implies
[TABLE]
The sum is a function of : . By the Gaussian Poincaré inequality Boucheron et al. [2013], we have
[TABLE]
where and is defined in (38). Hence,
[TABLE]
Note that for . Together with (59) this implies the statement (39).
7.5 Proof of Theorem 2
Recall that, due to (19), for ,
[TABLE]
By Section 5, for fixed , and any ,
[TABLE]
Now we fix and in , such that , and a non-empty subset . By the multivariate Faà di Bruno’s formula Constantine and Savits [1996],
[TABLE]
where the set is defined in Section 7.1. Hence, we obtain
[TABLE]
Using the bounds of Section 7.6 and Section 7.6, we obtain
[TABLE]
with a suitable constant . Substituting into (61), we obtain
[TABLE]
with a constant not depending on and . Hence, due to (60), we obtain
[TABLE]
For the truncated estimate , we obtain using (29) that
[TABLE]
Let us bound the quantity \sum_{1\leq\|\mathbf{k}\|\leq K}\bigl{[}\sum_{r=n_{0}+1}^{n-l}\bar{a}_{r,\mathbf{k}}(y)\bigr{]}^{2} for . First, let us show that for any , any ,
[TABLE]
Indeed, it holds
[TABLE]
provided that . Then
[TABLE]
and (64) follows. Note that, by the definition of ,
[TABLE]
The Parseval identity implies that
[TABLE]
Hence, selecting such that for , we obtain using (63) that
[TABLE]
7.6 Auxiliary lemmas
Lemma \thelemma.
Under the assumptions of Theorem 2, for any multi-index with , any , and , it holds
[TABLE]
with constant not depending on and . For , it holds .
Proof.
In the proof we use the notation
[TABLE]
For , we also write for th coordinate basis vector, that is, and for .
We preface the lemma by some elementary but useful identities. For any multi-index with , any , it holds
[TABLE]
[TABLE]
Since , obviously for . For , the statement of the lemma follows from (68). Now we consider the case . Fix and prove (65) for all by induction in . We start from . Note that for a given index and , the relation (67) implies
[TABLE]
Hence, we can write that
[TABLE]
where the matrix , with the entries
[TABLE]
The recurrence (69) implies that
[TABLE]
To bound , we observe
[TABLE]
Hence, using (68) and (70), we obtain
[TABLE]
Now we can apply Section 7.6 to bound with , and, using Section 7.6, we obtain for all , that
[TABLE]
which imply (65) for any multi-index with with the constant .
The induction hypothesis is therefore that the inequality
[TABLE]
holds for all multi-indices with and . We need to show (71) for all multi-indices with . The multivariate Faà di Bruno’s formula Constantine and Savits [1996] implies for , and , that
[TABLE]
Here \bigl{[}\partial_{z_{1}}^{\bm{\ell}_{i}}X_{p-1}^{x}\bigr{]}^{\mathbf{k}_{i}} is defined in (49), and the summation is taken over the set of multi-indices , such that for some , and for , for and are such that
[TABLE]
From the equation (72), taking the terms with out and using the fact that for any multi-index with , we have
[TABLE]
For and fixed , we then have
[TABLE]
with
[TABLE]
Furthermore,
[TABLE]
Note that the condition (73) implies that
[TABLE]
With the induction hypothesis (71), we bound as follows
[TABLE]
Due to [Constantine and Savits, 1996, Corollary 2.9],
[TABLE]
where is a Stirling number of a second kind (see Constantine [1987]). Hence, we can bound
[TABLE]
with some constant depending on . Thus, (74) and (75) imply
[TABLE]
We can again apply Section 7.6 and Section 7.6 to bound , and obtain (71) for all multi-indices with . This concludes the proof. ∎
Lemma \thelemma.
Let and be sequences of nonnegative real numbers with , satisfying
[TABLE]
*for any , and is some nonnegative constant. Then *
[TABLE]
Proof.
Applying (76) recursively, we get where we use the convention . The proof is completed by using an upper bound on . ∎
Lemma \thelemma.
Assume that there exist , such that for any , . Then for any , it holds
[TABLE]
Proof.
Note that for ,
[TABLE]
∎
Corollary \thecorollary.
Under the assumptions of Section 5, for all , it holds
[TABLE]
Acknowledgement
The publication was supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Assaraf and Caffarel [1999] R. Assaraf and M. Caffarel. Zero-variance principle for Monte Carlo algorithms. Physical review letters , 83(23):4682, 1999.
- 2Belomestny et al. [2018] D. Belomestny, S. Häfner, and M. Urusov. Variance reduction for discretised diffusions via regression. Journal of Mathematical Analysis and Applications , 458:393–418, 2018.
- 3Belomestny et al. [2020] D. Belomestny, L. Iosipoi, E. Moulines, A. Naumov, and S. Samsonov. Variance reduction for markov chains with application to MCMC. Statistics and Computing , 30(4):973–997, 2020. doi: 10.1007/s 11222-020-09931-z . URL https://doi.org/10.1007/s 11222-020-09931-z . · doi ↗
- 4Ben Zineb and Gobet [2013] T. Ben Zineb and E. Gobet. Preliminary control variates to improve empirical regression methods. Monte Carlo Methods Appl. , 19(4):331–354, 2013. ISSN 0929-9629. doi: 10.1515/mcma-2013-0015 . URL https://doi.org/10.1515/mcma-2013-0015 . · doi ↗
- 5Bortoli and Durmus [2020] V. D. Bortoli and A. Durmus. Convergence of diffusions and their discretizations: from continuous to discrete processes and back. 2020.
- 6Boucheron et al. [2013] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence . Oxford University Press, 2013.
- 7Brosse et al. [2018] N. Brosse, A. Durmus, S. Meyn, and E. Moulines. Diffusion approximations and control variates for MCMC. ar Xiv preprint ar Xiv:1808.01665 , 2018.
- 8Constantine [1987] G. M. Constantine. Combinatorial Theory and Statistical Design . Wiley, New York, 1987.
