Variance reduction for additive functional of Markov chains via   martingale representations

D. Belomestny; E. Moulines; S. Samsonov

arXiv:1903.07373·stat.CO·December 22, 2021

Variance reduction for additive functional of Markov chains via martingale representations

D. Belomestny, E. Moulines, S. Samsonov

PDF

Open Access

TL;DR

This paper introduces a new variance reduction technique for additive functionals of Markov chains using a discrete-time martingale representation, improving efficiency without requiring ergodicity or stationary distribution knowledge.

Contribution

The paper presents a novel non-asymptotic variance reduction method for Markov chains based on martingale representations, applicable to MCMC methods.

Findings

01

Cost-to-variance ratio is improved over naive algorithms.

02

Method does not require stationary distribution or ergodicity.

03

Numerical tests show enhanced performance in Langevin MCMC.

Abstract

In this paper we propose an efficient variance reduction approach for additive functionals of Markov chains relying on a novel discrete time martingale representation. Our approach is fully non-asymptotic and does not require the knowledge of the stationary distribution (and even any type of ergodicity) or specific structure of the underlying density. By rigorously analyzing the convergence properties of the proposed algorithm, we show that its cost-to-variance product is indeed smaller than one of the naive algorithm. The numerical performance of the new method is illustrated for the Langevin-type Markov Chain Monte Carlo (MCMC) methods.

Equations327

E [G (X_{1}) ∣ X_{0} = x] - G (x) = - f (x) + π (f) .

E [G (X_{1}) ∣ X_{0} = x] - G (x) = - f (x) + π (f) .

π (f) = \int_{R^{d}} f (x) π (d x),

π (f) = \int_{R^{d}} f (x) π (d x),

X_{p}^{x} = Φ_{p} (X_{p - 1}^{x}, ξ_{p}), p = 1, 2, \dots, X_{0} = x

X_{p}^{x} = Φ_{p} (X_{p - 1}^{x}, ξ_{p}), p = 1, 2, \dots, X_{0} = x

α (y, y^{'}) = min {1, \frac{π ( y ^{'} )}{π ( y )} \frac{q ( y ∣ y ^{'} )}{q ( y ^{'} ∣ y )}} .

α (y, y^{'}) = min {1, \frac{π ( y ^{'} )}{π ( y )} \frac{q ( y ∣ y ^{'} )}{q ( y ^{'} ∣ y )}} .

\displaystyle q(y|x)=(\gamma)^{-d/2}\bm{\varphi}\Bigl{(}[y-x+\gamma\mu(x)]/\sqrt{\gamma}\Bigr{)}\,.

\displaystyle q(y|x)=(\gamma)^{-d/2}\bm{\varphi}\Bigl{(}[y-x+\gamma\mu(x)]/\sqrt{\gamma}\Bigr{)}\,.

X_{p + 1}^{x}

X_{p + 1}^{x}

Y_{p + 1}

\displaystyle\Phi_{p}(x,(u,z)^{\top})=x+\mathbbm{1}\bigl{(}u\leq\alpha(x,x-\gamma\mu(x)+\sqrt{\gamma}z)\bigr{)}(-\gamma\mu(x)+\sqrt{\gamma}z).

\displaystyle\Phi_{p}(x,(u,z)^{\top})=x+\mathbbm{1}\bigl{(}u\leq\alpha(x,x-\gamma\mu(x)+\sqrt{\gamma}z)\bigr{)}(-\gamma\mu(x)+\sqrt{\gamma}z).

d X_{t}^{x} = b (X_{t}^{x}) d t + σ (X_{t}^{x}) d W_{t}, X_{0} = x, t \geq 0,

d X_{t}^{x} = b (X_{t}^{x}) d t + σ (X_{t}^{x}) d W_{t}, X_{0} = x, t \geq 0,

Lg = b^{⊤} \nabla g + \frac{1}{2} σ^{⊤} D^{2} g σ

Lg = b^{⊤} \nabla g + \frac{1}{2} σ^{⊤} D^{2} g σ

x \in R^{d} sup L V (x) < \infty, ∣ x ∣ \to \infty lim sup L V (x) < 0,

x \in R^{d} sup L V (x) < \infty, ∣ x ∣ \to \infty lim sup L V (x) < 0,

X_{n + 1}^{x} = X_{n}^{x} + γ_{n + 1} b (X_{n}^{x}) + σ (X_{n}^{x}) (W_{Γ_{n + 1}} - W_{Γ_{n}}), n \geq 0, X_{0} = x,

X_{n + 1}^{x} = X_{n}^{x} + γ_{n + 1} b (X_{n}^{x}) + σ (X_{n}^{x}) (W_{Γ_{n + 1}} - W_{Γ_{n}}), n \geq 0, X_{0} = x,

π_{n}^{γ} (f) = \frac{1}{Γ _{n}} i = 1 \sum n γ_{i} f (X_{i}^{x}) .

π_{n}^{γ} (f) = \frac{1}{Γ _{n}} i = 1 \sum n γ_{i} f (X_{i}^{x}) .

π (x) = Z^{- 1} e^{- U (x) /2}, Z = \int_{R^{d}} e^{- U (x) /2} d x,

π (x) = Z^{- 1} e^{- U (x) /2}, Z = \int_{R^{d}} e^{- U (x) /2} d x,

X_{p + 1}^{x} = X_{p}^{x} - γ \nabla U (X_{p}^{x}) /2 + γ Z_{p + 1}, X_{0}^{x} = x,

X_{p + 1}^{x} = X_{p}^{x} - γ \nabla U (X_{p}^{x}) /2 + γ Z_{p + 1}, X_{0}^{x} = x,

E [ϕ_{i} (ξ) ϕ_{j} (ξ)] = δ_{ij}, i, j \in Z_{+}

E [ϕ_{i} (ξ) ϕ_{j} (ξ)] = δ_{ij}, i, j \in Z_{+}

X_{l, p}^{x} := G_{l, p} (x, ξ_{l + 1}, \dots, ξ_{p})

X_{l, p}^{x} := G_{l, p} (x, ξ_{l + 1}, \dots, ξ_{p})

G_{l, p} (x, y_{l + 1}, \dots, y_{p}) := Φ_{p} (\cdot, y_{p}) \circ Φ_{p - 1} (\cdot, y_{p - 1}) \circ \dots \circ Φ_{l + 1} (x, y_{l + 1})

G_{l, p} (x, y_{l + 1}, \dots, y_{p}) := Φ_{p} (\cdot, y_{p}) \circ Φ_{p - 1} (\cdot, y_{p - 1}) \circ \dots \circ Φ_{l + 1} (x, y_{l + 1})

E [f (X_{p}^{x}) G_{l}] = \int [f \circ G_{l, p}] (X_{l}^{x}, e_{l + 1}, \dots, e_{p}) P_{ξ} (d e_{l + 1}) \dots P_{ξ} (d e_{p}) .

E [f (X_{p}^{x}) G_{l}] = \int [f \circ G_{l, p}] (X_{l}^{x}, e_{l + 1}, \dots, e_{p}) P_{ξ} (d e_{l + 1}) \dots P_{ξ} (d e_{p}) .

f (X_{q}^{x}) = E [f (X_{q}^{x}) G_{j}] + k = 1 \sum \infty l = j + 1 \sum q a_{q, l, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l})

f (X_{q}^{x}) = E [f (X_{q}^{x}) G_{j}] + k = 1 \sum \infty l = j + 1 \sum q a_{q, l, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l})

a_{q, l, k} (y) = E [f (X_{l - 1, q}^{y}) ϕ_{k} (ξ_{l})], q \geq l, k \in N .

a_{q, l, k} (y) = E [f (X_{l - 1, q}^{y}) ϕ_{k} (ξ_{l})], q \geq l, k \in N .

f (X_{q}^{x}) = E [f (X_{q}^{x}) G_{j}] + k = 1 \sum \infty l = j + 1 \sum q \overset{a}{ˉ}_{q - l + 1, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l})

f (X_{q}^{x}) = E [f (X_{q}^{x}) G_{j}] + k = 1 \sum \infty l = j + 1 \sum q \overset{a}{ˉ}_{q - l + 1, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l})

\overset{a}{ˉ}_{r, k} (y) = E [f (X_{r}^{y}) ϕ_{k} (ξ_{1})] r, k \in N .

\overset{a}{ˉ}_{r, k} (y) = E [f (X_{r}^{y}) ϕ_{k} (ξ_{1})] r, k \in N .

k = 1 \sum \infty l = j + 1 \sum q β_{q, l, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l})

k = 1 \sum \infty l = j + 1 \sum q β_{q, l, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l})

k = 1 \sum \infty l = j + 1 \sum q a_{q, l, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l})

k = 1 \sum \infty l = j + 1 \sum q a_{q, l, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l})

a_{q, l, k} (x) = E [ϕ_{k} (ξ) Q_{l, q} (Φ_{l} (x, ξ))]

a_{q, l, k} (x) = E [ϕ_{k} (ξ) Q_{l, q} (Φ_{l} (x, ξ))]

\overset{a}{ˉ}_{r, k} (x) = E [ϕ_{k} (ξ) Q_{r - 1} (Φ (x, ξ))] with Q_{r} (y) = E [f (X_{r}^{y})], r \in N .

\overset{a}{ˉ}_{r, k} (x) = E [ϕ_{k} (ξ) Q_{r - 1} (Φ (x, ξ))] with Q_{r} (y) = E [f (X_{r}^{y})], r \in N .

A_{q, k} (y) = r = 1 \sum q \overset{a}{ˉ}_{r, k} (y) .

A_{q, k} (y) = r = 1 \sum q \overset{a}{ˉ}_{r, k} (y) .

π_{n}^{x} (f) = \frac{1}{n} q = 1 \sum n E [f (X_{q}^{x})] + \frac{1}{n} k = 1 \sum \infty M_{n, k}^{x}, with M_{n, k}^{x} = l = 1 \sum n A_{n - l + 1, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l}) .

π_{n}^{x} (f) = \frac{1}{n} q = 1 \sum n E [f (X_{q}^{x})] + \frac{1}{n} k = 1 \sum \infty M_{n, k}^{x}, with M_{n, k}^{x} = l = 1 \sum n A_{n - l + 1, k} (X_{l - 1}^{x}) ϕ_{k} (ξ_{l}) .

Var (M_{n, k}^{x}) = \l = 1 \sum n E [A_{n - l + 1, k}^{2} (X_{l - 1}^{x})] and Cov (M_{n, k}^{x}, M_{n, k^{'}}^{x}) = 0 .

Var (M_{n, k}^{x}) = \l = 1 \sum n E [A_{n - l + 1, k}^{2} (X_{l - 1}^{x})] and Cov (M_{n, k}^{x}, M_{n, k^{'}}^{x}) = 0 .

M_{n}^{(x, K)} = k = 1 \sum K M_{n, k}^{x} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Protein Structure and Dynamics · Theoretical and Computational Physics

Full text

\newaliascnt

lemmatheorem \aliascntresetthelemma

\newaliascntcorollarytheorem \aliascntresetthecorollary

\newaliascntpropositiontheorem \aliascntresettheproposition

\newaliascntdefinitiontheorem \aliascntresetthedefinition

\newaliascntdefinitionPropositiontheorem \aliascntresetthedefinitionProposition

\newaliascntremarktheorem \aliascntresettheremark

Variance reduction for additive functional of Markov chains via martingale representations

D. Belomestny 111Duisburg-Essen University, Germany, and HSE University, Russia, [email protected]. , E. Moulines 222Ecole Polytechnique, France, and HSE University, Russia, [email protected]., and S. Samsonov 333HSE University, Russia, [email protected].

Abstract

In this paper we propose an efficient variance reduction approach for additive functionals of Markov chains relying on a novel discrete time martingale representation. Our approach is fully non-asymptotic and does not require the knowledge of the stationary distribution (and even any type of ergodicity) or specific structure of the underlying density. By rigorously analyzing the convergence properties of the proposed algorithm, we show that its cost-to-variance product is indeed smaller than one of the naive algorithm. The numerical performance of the new method is illustrated for the Langevin-type Markov Chain Monte Carlo (MCMC) methods.

1 Introduction

Markov chains and Markov Chain Monte Carlo (MCMC) algorithms play a crucial role in modern numerical analysis, finding various applications in such research areas as Bayesian inference, reinforcement learning and online learning. As an illustration, suppose that we aim at computing $\pi(f):=\int f(x)\pi(\mathrm{d}x)$ , where $f:\mathbb{R}^{d}\to\mathbb{R}$ is a function in $\mathrm{L}^{2}(\pi)$ and $\pi$ has a smooth and everywhere positive density w.r.t the Lebesgue measure (By abuse of notation, we use the same notation for the probability measure and its density with respect to the Lebesgue measure). Typically it is not possible to compute $\pi(f)$ analytically, and a common solution is to use approximations based on Monte Carlo methods. Given independent identically distributed observations $X_{1},\ldots,X_{n}$ from $\pi$ , we might estimate $\pi(f)$ by $\pi_{n}(f):=n^{-1}\sum_{k=1}^{n}f(X_{k})$ . The variance of such estimate equals $\sigma^{2}(f)/n$ with $\sigma^{2}(f)$ being the variance of the integrand with respect to $\pi$ . The first way to obtain a tighter estimate $\pi_{n}(f)$ is simply to increase the sample size $n$ . Unfortunately, this solution might be prohibitively costly, especially when the dimension $d$ is large enough and sampling from $\pi$ is complicated. An alternative approach is to decrease $\sigma^{2}(f)$ by constructing a new Monte Carlo experiment with the same expectation as the original one, but with a lower variance. Such methods are known as variance reduction techniques. Introduction to many of them can be found in Rubinstein and Kroese [2016], Gobet [2016], Glasserman [2013].

One of the popular approaches to variance reduction is the control variates method (see [South et al., 2021] and the references therein). It aims at constructing a cheaply computable random variable $\zeta$ (control variate) with $\mathsf{E}[\zeta]=0$ and $\mathsf{E}[\zeta^{2}]<\infty$ , such that the variance of the random variable $f(X)+\zeta$ is small, where $X\sim\pi$ . One of the main difficulties here is to construct a class of control variates $\zeta$ satisfying $\mathsf{E}[\zeta]=0$ . The complexity of this problem essentially depends on the degree of our knowledge on $\pi$ . For example, if $\pi$ is analytically known and satisfies some regularity conditions, one can apply the well-known technique of polynomial interpolation to construct control variates enjoying some optimality properties, see, for example [Dimov, 2008, Section 3.2]. Alternatively, if an orthonormal system in $\mathrm{L}^{2}(\pi)$ is analytically available, one can build control variates $\zeta$ as a linear combination of the corresponding basis functions, see [Ben Zineb and Gobet, 2013]. Furthermore, if $\pi$ is known only up to a normalizing constant (which is often the case in Bayesian statistics), one can apply the recent approach of constructing control variates depending only on the gradient $\nabla\log\pi$ using either a Schr $\ddot{\text{o}}$ dinger-type Hamiltonian operator in [Assaraf and Caffarel, 1999, Mira et al., 2013], or the Stein operator in [Brosse et al., 2018]. In some situations $\pi$ is not known analytically, but $X$ can be represented as a function of simple random variables with known distribution. Such situation arises, for example, in the case of functionals of discretized diffusion processes. In this case a Wiener chaos-type decomposition can be used to construct control variates with nice theoretical properties, see [Belomestny et al., 2018]. Note that in order to compare different variance reduction approaches, one has to analyze their complexity, that is, the number of numerical operations required to achieve a prescribed magnitude of the resulting variance.

Unfortunately, it is not always possible to generate independent observations distributed according to $\pi$ . To overcome this problem one might consider MCMC algorithms, where the exact samples from $\pi$ are replaced by $(X_{p})_{p\geq 0},$ forming a Markov chain with a marginal distribution of $X_{n}$ converging to $\pi$ in a suitable metric as $n$ goes to infinity. It is still possible to apply the control variates method in a similar manner to the plain Monte Carlo case, yet the choice of the optimal control variate becomes much more involved. Due to significant correlations between the elements of the Markov chain, it might be not enough to minimize the marginal variances of $(X_{p})_{p\geq 0}$ as it was in independent case. Instead one may choose the control variate by minimizing the corresponding asymptotic variance of the chain as it is suggested in Belomestny et al. [2020]. At the same time it is possible to express the optimal control variate in terms of the solution of the Poisson equation for the corresponding Markov chain $(X_{p})_{p\geq 0}$ . As it was observed in Henderson [1997], Henderson and Simon [2004], for a time-homogeneous Markov chain $(X_{p})_{p\geq 0}$ with a stationary distribution $\pi$ , the function $U_{G}(x):=G(x)-\mathsf{E}[G(X_{1})|X_{0}=x]$ has zero mean with respect to $\pi$ for an arbitrary real-valued function $G:\mathbb{R}^{d}\to\mathbb{R}$ , such that $G\in L^{1}(\pi)$ . Hence, $U_{G}(x)$ is a valid control functional for a suitable choice of $G$ , with the best $G$ given by a solution of the Poisson equation

[TABLE]

For such $G$ we obtain $f(x)-U_{G}(x)=f(x)-f(x)+\pi(f)=\pi(f)$ leading to an ideal estimator with zero variance. Despite the fact that the Poisson equation involves the quantity of interest $\pi(f)$ and can not be solved explicitly in most cases, this idea still can be used to construct some approximations for the optimal zero-variance control variates. For example, Henderson [1997] proposed to compute approximations to the solution of the Poisson equation for specific Markov chains with particular emphasis on models arising in stochastic network theory. In Dellaportas and Kontoyiannis [2012] and Brosse et al. [2018] series-type control variates are introduced and studied for reversible Markov chains. It is assumed in Dellaportas and Kontoyiannis [2012] that the one-step conditional expectations can be computed explicitly for a set of basis functions. Brosse et al. [2018] proposed another approach tailored to diffusion setting which does not require the computation of integrals of basis functions and only involves applications of the underlying generator. For more information on diffusion based algorithms we refer reader to the recent works [Dalalyan, 2017, Durmus and Moulines, 2017, Lemaire, 2007, Pagès and Panloup, 2018]. Another family of variance reduction techniques aims at constructing a parametric class of control variates with zero mean with respect to the ergodic measure $\pi$ . A popular choice is Stein control variates (see Belomestny et al. [2020], Mira et al. [2013], Oates et al. [2016], South et al. [2018, 2021] and references therein).

In this paper we propose a generic variance reduction method for additive functionals of Markov chains. Compared to Stein control variates techniques, the knowledge of the stationary distribution is not required. The variance reduction method we propose thus applies not only to MCMC methods (for which the distribution $\pi$ is known), but also to the more general setting in which the stationary distribution is not analytically known; such examples arise, in particular, when one wishes to integrate according to the stationary distribution of an ergodic diffusion or to estimate the value function (or the gradient of the value function) in reinforcement learning algorithms.

Compared to Dellaportas and Kontoyiannis [2012], our approach is not restricted to $\pi$ -reversible Markov kernels. We provide a non-asymptotic analysis for the so-called normal noise model, which covers as a special example the Langevin dynamics. We also consider variance reduction in the problem of estimating the expectation of functions under the unknown stationary distribution of ergodic diffusion process.

The paper is organized as follows. In Section 2 we set up the problem and introduce some notations. In Section 3, we outline the construction of a novel martingale representation. In Section 4 we show how this martingale representation can be used to construct control variates. In Section 5 we analyze performance of the proposed variance reduction algorithm in case of the Markov chain, driven by the normal noise (see Section 5 for the precise definition). Finally, in Section 6 we illustrate our findings on different numerical examples.

2 Setup

Our aim is to numerically compute expectations of the form

[TABLE]

where $f:$ $\mathbb{R}^{d}\rightarrow\mathbb{R}$ and $\pi$ is a probability measure supported on $\mathbb{R}^{d}$ equipped with its Borel $\sigma$ -field. If $d$ is large and $\pi(f)$ can not be computed analytically, one can apply Monte Carlo methods. However, in many practical situations direct sampling from $\pi$ is impossible and this precludes the use of plain Monte Carlo methods in this case. One popular alternative to Monte Carlo is Markov Chain Monte Carlo (MCMC) where one is looking for a discrete time (possibly non-homogeneous) Markov chain $(X^{x}_{p})_{p\in\mathbb{N}_{0}}$ such that $\pi$ is its unique invariant measure. In this paper we study a class of MCMC algorithms with $(X^{x}_{p})_{p\in\mathbb{N}_{0}}$ satisfying the following recurrence relation:

[TABLE]

for some i.i.d. random vectors $\xi_{p}\in\mathbb{R}^{m}$ with distribution $P_{\xi}$ and some Borel-measurable functions $\Phi_{p}\colon\mathbb{R}^{d}\times\mathbb{R}^{m}\to\mathbb{R}^{d}.$ In fact, this is quite general class of Markov chains (see Douc et al. [2018, Theorem 1.3.6]) and many well-known MCMC algorithms can be represented in the form (2). Let us consider two popular examples.

Example 2.1 (Metropolis-Adjusted Langevin Algorithm).

The Metropolis-Hastings algorithm associated with a target density $\pi$ requires to choose a proposal transition density $q$ . The Markov chain is constructed as follows:

Given the previous state $X^{x}_{p}$ , we generate a proposal $Y_{p+1}\sim q(\cdot|X^{x}_{p})$ 2. 2.

Accept the proposal $X^{x}_{p+1}=Y_{p+1}$ with probability $\alpha(X^{x}_{p},Y_{p+1})$ where

[TABLE]

Otherwise, set $X^{x}_{p+1}=X^{x}_{p}$ .

This transition is reversible with respect to $\pi$ and therefore preserves the stationary density $\pi$ ; see [Douc et al., 2018, Chapter 2]. If $q$ has a wide enough support to eventually reach any region of the state space with positive mass under $\pi$ , then this transition is irreducible and $\pi$ is a maximal irreducibility measure Mengersen and Tweedie [1996]. The Metropolis-Adjusted Langevin algorithm (MALA) takes (6) as proposal, that is,

[TABLE]

with $\bm{\varphi}(z):=(\sqrt{2\pi})^{-1}\mathrm{e}^{-\|z\|^{2}/2}$ is a density of the standard normal random variable. It is not difficult to see that the MALA chain can be compactly represented in the form

[TABLE]

where $(U_{p})_{p\geq 1}$ is an i.i.d. sequence of uniformly distributed on $[0,1]$ random variables independent of $(Z_{p})_{p\geq 1}.$ Thus, we recover (2) with $\xi_{p}=(U_{p},Z_{p})\in\mathbb{R}^{d+1}$ and

[TABLE]

Example 2.2.

Let $(\mathsf{X}^{x}_{t})_{t\geq 0}$ be the unique strong solution to SDE of the form:

[TABLE]

where $b:$ $\mathbb{R}^{d}\to\mathbb{R}^{d}$ and $\sigma:$ $\mathbb{R}^{d}\times\mathbb{R}^{m}\to\mathbb{R}^{d}$ are locally Lipschitz continuous functions with at most linear growth. The process $(\mathsf{X}^{x}_{t})_{t\geq 0}$ is a Markov process and let $L$ denote its infinitesimal generator defined by

[TABLE]

for any $g\in C_{0}^{2}(\mathbb{R}^{d}).$ If there exists a twice continuously differentiable Lyapunov function $V:$ $\mathbb{R}^{d}\to\mathbb{R}_{+}$ such that

[TABLE]

then there is an invariant probability measure $\pi$ Invariant measures are crucial in the study of the long term behaviour of stochastic differential systems (3). Under some additional assumptions, the invariant measure $\pi$ is ergodic and this property can be exploited to compute the integrals $\pi(f)$ for $f\in L^{2}(\pi)$ by means of ergodic averages. The idea is to replace the diffusion $X$ by a (simulable) discretization scheme of the form (see e.g. [Pagès and Panloup, 2018], [Lamberton and Pagès, 2002])

[TABLE]

where $\Gamma_{n}=\gamma_{1}+\ldots+\gamma_{n}$ and $(\gamma_{n})_{n\geq 1}$ is a non-increasing sequence of time steps. Then for a function $f\in L^{2}(\pi)$ we can approximate $\pi(f)$ via

[TABLE]

Due to typically high correlation between $X^{x}_{1},X^{x}_{2},\ldots$ , variance reduction is of crucial importance here. As a matter of fact, in many cases there is no explicit formula for the invariant measure and this makes the use of the Stein control functions (see e.g. [Mira et al., 2013, Oates et al., 2017]) impossible in this case.

If $b=-\nabla U/2$ for some continuously differentiable function $U$ , and $\sigma=1$ , the Markov chain (4) can be used to approximately sample from the density

[TABLE]

provided that $Z<\infty$ . This method is usually referred to as Unadjusted Langevin Algorithm (ULA). In practice, a constant step-size discretization

[TABLE]

is often considered, where $\left(Z_{p}\right)_{p\geq 1}$ is an i.i.d. sequence of $d$ -dimensional standard Gaussian random vectors. Note that the invariant distribution $\pi_{\gamma}$ of the chain (6) is in general different from $\pi$ and is not available analytically, although $\pi_{\gamma}$ converges to $\pi$ when $\gamma\rightarrow 0$ , see Mattingly et al. [2002], Durmus and Moulines [2017]. Hence the methods based on the Stein control variates will introduce additional bias when applied to (6).

3 Martingale representation

In this section we provide a general discrete-time martingale representation for Markov chains of type (2) which is used later to construct an efficient variance reduction algorithm. Let $(\phi_{k})_{k\in\mathbb{Z}_{+}}$ be a complete orthonormal system in $\mathrm{L}^{2}(\mathbb{R}^{m},P_{\xi})$ with $\phi_{0}\equiv 1$ . In particular, we have

[TABLE]

with $\xi\sim P_{\xi}.$ Notice that this implies that the random variables $\phi_{k}(\xi)$ , $k\geq 1$ , are centered. As an example, we can take multivariate Hermite polynomials for the ULA algorithm and a tensor product of shifted Legendre polynomials for ”uniform part” and Hermite polynomials for ”Gaussian part” of the random variable $\xi=(u,z)^{T}$ in MALA, as the shifted Legendre polynomials are orthogonal with respect to the Lebesgue measure on $[0,1].$

Let $(\xi_{p})_{p\in\mathbb{N}}$ be i.i.d. $m-$ dimensional random vectors with distribution $\mathsf{P}_{\xi}$ . We denote via $(\mathcal{G}_{p})_{p\in\mathbb{N}_{0}}$ the filtration generated by $(\xi_{p})_{p\in\mathbb{N}}$ with the convention $\mathcal{G}_{0}=\mathrm{triv}$ . Let $\Phi_{k}:\mathbb{R}^{d}\times\mathbb{R}^{m}\to\mathbb{R}^{d}$ be a measurable function. Set for $l\leq p$ and $x\in\mathbb{R}^{d}$ ,

[TABLE]

with the functions $G_{l,p}:$ $\mathbb{R}^{d+m\times(p-l+1)}\to\mathbb{R}^{d}$ defined as

[TABLE]

with the convention $G_{l,l}(x)=x$ . Note that for any bounded measurable function $f$ , any $x\in\mathbb{R}^{d}$ and $l\leq p,\,l,p\in\mathbb{N}$ , it holds

[TABLE]

We write $X^{x}_{p}$ and $G_{p}$ as a shorthand notation for $X^{x}_{0,p}$ and $G_{0,p}$ , respectively. We formulate the results below for bounded measurable functions, but these results can be easily extended to unbounded functions at the expense of using classical drift conditions to control the moments. For simplicity and readability, we leave this elementary extension to the reader.

Theorem 1.

For any $q\in\mathbb{N}$ , any $j<q,j\in\mathbb{N}$ , any Borel bounded functions $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ and $x\in\mathbb{R}^{d}$ the following representation holds in $\mathrm{L}^{2}(\mathbb{R}^{mq},P^{\otimes q}_{\xi})$

[TABLE]

where $X^{x}_{q}$ is given by (7) and for any $y\in\mathbb{R}^{d}$ ,

[TABLE]

Proof.

The proof is postponed to Section 7.2. ∎

Corollary \thecorollary.

Assume that $\Phi_{l}=\Phi$ , for all $l\geq 1$ . Then for any $q\in\mathbb{N}$ , $j<q$ , $f$ a bounded measurable function, and $x\in\mathbb{R}^{d}$ , it holds in $\mathrm{L}^{2}\bigl{(}\mathbb{R}^{mq},P^{\otimes q}_{\xi}\bigr{)}$

[TABLE]

where for all $y\in\mathbb{R}^{d}$ ,

[TABLE]

Discussion

The representation (9) is remarkable for two reasons. First, it suggests a general way of constructing zero-mean random variables adapted to the filtration $(\mathcal{G}_{p})_{p\geq 0}.$ Indeed any random variable of the form

[TABLE]

for some measurable functions $(\beta_{q,l,k})$ has zero mean (conditional on $\mathcal{G}_{j}$ ) and is adapted to $\mathcal{G}_{q-1}.$ Second, it shows that for coefficients defined in (10) the representation (9) computes exactly $f(X^{x}_{q}),$ that is, the control variate

[TABLE]

is perfect and leads to zero variance when computing $\mathsf{E}\left[\left.f(X^{x}_{q})\,\right|\mathcal{G}_{j}\right]$ by Monte Carlo. Another equivalent representation of the coefficients $a_{p,l,k}$ turns out to be more useful in practice.

Proposition \theproposition.

Let $q\geq l,k\in\mathbb{N}$ . Then the coefficients $a_{q,l,k}$ in (10) can be alternatively represented as

[TABLE]

with $Q_{l,q}(y)=\mathsf{E}\left[f(X^{y}_{l,q})\right],$ $q\geq l.$ In the homogeneous case $\Phi_{l}=\Phi$ , the coefficients $\bar{a}_{r,k}$ in (11) are given respectively for all $r\in\mathbb{N},$ by

[TABLE]

We now show how the representation (9) can be used to construct variable for of additive functionals of Markov chains. For the sake of clarity, in the sequel, we consider only the time homogeneous case ( $\Phi_{l}=\Phi$ for all $l\in\mathbb{N}$ ). For $f$ a bounded measurable function, denote $\pi^{x}_{n}(f)=n^{-1}\sum_{p=1}^{n}f(X^{x}_{p})$ , where $n\in\mathbb{N}$ is the number of samples. To avoid overloading the notations, the dependence in the initial condition $x$ is removed when it can be inferred from the context For any $q\in\mathbb{N}$ , $k\in\mathbb{N}$ , and $y\in\mathbb{R}^{d}$ , set

[TABLE]

Section 3 applied with $j=0$ implies that for any $x\in\mathbb{R}^{d}$ ,

[TABLE]

Since $\xi_{l}$ is independent of $\mathcal{G}_{l-1}$ , $X^{x}_{l-1}$ is $\mathcal{G}_{l-1}$ measurable, and $\mathsf{E}[\phi_{k}(\xi_{l})]=0,$ $k\neq 0$ , we get for any measurable function $g$ with $\mathsf{E}\left[g^{2}(X_{l-1}^{x})\right]<\infty$ that $\mathsf{E}[g(X^{x}_{l-1})\phi_{k}\left(\xi_{l}\right)]=$ $\mathsf{E}[g(X^{x}_{l-1})\mathsf{E}\left[\left.\phi_{k}\left(\xi_{l}\right)\,\right|\mathcal{G}_{l-1}\right]]=0$ . This implies that for any $k=1,\dots,K$ , $\bigl{(}M^{x}_{p,k}\bigr{)}_{p=1}^{\infty}$ is a square-integrable martingale sequence with respect to filtration $(\mathcal{G}_{p})_{p\geq 1}$ and hence that, for any $n,k\in\mathbb{N}$ and $x\in\mathbb{R}^{d}$ , $\mathsf{E}[M^{x}_{n,k}]=0$ . In addition, since $\mathsf{E}[\phi_{k}(\xi_{l})\phi_{k^{\prime}}(\xi_{l})]=0$ if $k\neq k^{\prime}$ , we obtain for any $1\leq k<k^{\prime}$ ,

[TABLE]

Fix some $K\in\mathbb{N}$ and denote

[TABLE]

The expansion (15) suggests to consider the following estimator

[TABLE]

By construction, for any $n\in\mathbb{N}$ and $x\in\mathbb{R}^{d}$ , $\mathsf{E}[\pi^{(x,K)}_{n}(f)]=\mathsf{E}[\pi^{x}_{n}(f)]$ as $\mathsf{E}[M^{(x,K)}_{n}]=0.$ Moreover, we obtain

[TABLE]

Hence we expect $\mathsf{Var}[\pi^{(x,K)}_{n}(f)]$ to be small, provided that $\mathsf{Var}[M^{x}_{n,k}]$ decay fast enough as $k\to\infty$ .

If the empirical mean estimator $\pi^{x}_{n}(f)$ is convergent in quadratic mean, the same is true for $\pi_{n}^{(x,K)}(f)$ . This is formalized in the following result. Denote by $P$ the Markov kernel of the Markov chain (2), defined for any bounded measurable function $f$ by $Pf(x)=\int f\circ\Phi(x,e)P_{\xi}(\mathrm{d}e)$ .

Proposition \theproposition.

Assume that the Markov kernel $P$ has a unique invariant probability measure $\pi$ and that for any bounded measurable function $f$ , and $x\in\mathbb{R}^{d}$ ,

[TABLE]

Then, for any $K\in\mathbb{N}$ ,

[TABLE]

The proof of Section 3 is an elementary consequence of (19) and is left to the reader. A direct consequence is that if the sequence of estimator $\{\pi_{n}^{x}(f)\}_{n=1}^{\infty}$ is consistent in quadratic mean then $\{\pi_{n}^{(x,K)}(f)\}_{n=1}^{\infty}$ is also consistent in quadratic mean. For any $n\in\mathbb{N}$ and $x\in\mathbb{R}^{d}$ , $\mathsf{E}[\pi_{n}^{x}(f)]=\mathsf{E}[\pi_{n}^{(x,K)}(f)]$ and the variance of $\pi_{n}^{(x,K)}(f)$ is always smaller than that of $\pi_{n}^{x}(f)$ .

Below is a simple illustrative example showing that $\mathsf{Var}[\pi^{(x,K)}_{n}(f)]$ can be much smaller than $\mathsf{Var}[\pi^{x}_{n}(f)]$ even for $K=1$ .

Example 3.1.

Suppose that we aim at sampling from the Gaussian distribution with zero mean and variance $1/2$ with the density $\pi(x)=(\sqrt{\pi})^{-1}\mathrm{e}^{-x^{2}}\,$ using the ULA algorithm (see 2.2 and equation (6)). We consider the Markov chain given by

[TABLE]

where $(\xi_{p})_{p\geq 1}$ is an i.i.d. sequence of normally distributed random variables with zero mean and unit variance. The invariant distribution of this Markov chain is Gaussian with zero mean and variance $1/(2-\gamma)$ . As a complete orthogonal system in $\mathrm{L}^{2}(\mathbb{R},P_{\xi})$ , we consider the normalized Hermite polynomials on $\mathbb{R}$ , that is,

[TABLE]

Consider now the problem of estimating $\pi(f)$ for $f(x)=x^{2}$ . Note that

[TABLE]

and by recalling the definition of the Hermite polynomials, we arrive at the martingale representation

[TABLE]

where for $z\in\mathbb{R}$ and $q\in\mathbb{N}$ ,

[TABLE]

We stress that the coefficients $\bar{a}_{q,2}$ do not depend on $z$ in this special example. Due to (14) and (23), we can represent $\pi^{x}_{n}(f)$ as

[TABLE]

where, for any $z\in\mathbb{R}$ and $q\in\mathbb{N}$ ,

[TABLE]

The decomposition (25) implies that

[TABLE]

*Hence, given that $\gamma<1$ , we estimate *

[TABLE]

which does not depend upon $x$ . On the other hand, $\mathsf{Var}\left(\pi^{x}_{n}(f)\right)=n^{-2}\sum_{i,j}\mathsf{Cov}{\bigl{(}f(X_{i}^{x}),f(X_{j}^{x})\bigr{)}}$ , and since $X_{i}^{x}$ and $X_{j}^{x}$ are Gaussian random variables, application of the Isserlis formula yields

[TABLE]

Using the identity $\mathsf{Cov}{\bigl{(}X_{i}^{x},X_{j}^{x}\bigr{)}}=\gamma\sum_{k=1}^{i\wedge j}(1-\gamma)^{i+j-2k}\geq(1/2)\bigl{(}(1-\gamma)^{|i-j|}-(1-\gamma)^{i+j}\bigr{)}$ , we get

[TABLE]

Thus, for $\gamma\in(0,1/2]$ , $\mathsf{Var}\left(\pi^{(x,1)}_{n}(f)\right)\Bigl{/}\mathsf{Var}\left(\pi^{x}_{n}(f)\right)\leq 4\gamma$ , and the variance reduction effect is large when $\gamma\downarrow 0^{+}$ . To make the variance reduction effect clear, we plot the ratio $\mathsf{Var}\left(\pi^{x}_{n}(f)\right)/\mathsf{Var}\bigl{(}\pi_{n}^{(x,1)}(f)\bigr{)}$ , computed according to (26) with $n=10^{4}$ , $x=1$ and different values of the step size $\gamma$ . The corresponding plots are provided in Figure 1. It illustrates first that the gain in variance indeed scales linearly in $\gamma$ for $\gamma$ small enough, and, second, that even for moderate values of $\gamma$ , estimate $\pi^{(x,1)}_{n}(f)$ is preferable in terms of variance. ∎

4 Martingale Decomposition Control Variate (MAD-CV) algorithm

We now describe an algorithm to estimate the martingale $(M^{(x,K)}_{n})$ introduced in (17). To keep the computational complexity at a reasonable level, we estimate a fixed number of coefficients $\bar{a}_{r,k}$ for $r=1,\ldots,n_{0}$ , where the truncation index $n_{0}$ does not depend on $n$ . The corresponding martingale then is written as $M^{(x,K)}_{n,n_{0}}=\sum_{k=1}^{K}M^{x}_{n,k,n_{0}}$ where

[TABLE]

and we consider the truncated version of the estimator (18):

[TABLE]

The truncation leads to an increase of the variance. More precisely, proceeding as in (19), we obtain

[TABLE]

where $A_{r,k}$ is defined in (14). To illustrate the effect of truncation on the variance reduction, we compare $\mathsf{Var}[\pi^{(x,K)}_{n,n_{0}}(f)]$ and $\mathsf{Var}[\pi^{(x,K)}_{n}(f)]$ using the simple 3.1.

Example 3.1 (continued).

We now consider the estimate $\pi^{(x,1)}_{n,n_{0}}(f)=\pi^{x}_{n}(f)-n^{-1}M^{x}_{n,1,n_{0}}$ where $M^{x}_{n,1,n_{0}}$ is the truncated martingale

[TABLE]

where $\bar{a}_{l,i}$ , $i=1,2$ is defined in (24). Then, proceeding as in 3.1, we obtain

[TABLE]

where $M_{n,2}^{x}$ is defined in (25). Plugging (24) in the previous identity, we get

[TABLE]

Setting $n_{0}=\lceil\log{\gamma^{-1}}/(4\gamma)\rceil$ , we obtain $R^{x}_{\gamma,n_{0}}\lesssim 1/n+x^{2}/(n^{2}\gamma)$ for $n\gamma>1$ , yielding the same (up to a constant factor) variance reduction factor, that is,

[TABLE]

Here we write $\gtrsim$ and $\lesssim$ for inequality up to a constant not depending on $n,\gamma$ and $x$ .

The last step to define an estimator of the coefficients $A_{q,k,n_{0}}$ . In the previous example, the calculation of these coefficients is explicit, but this is obviously not the case in general. We propose to use the representation outlined in (13) of the functions $\bar{a}_{r,k}$ . This representation suggests to first approximate the $r$ -th step predictor $Q_{r}(y)=\mathsf{E}\left[f(X^{y}_{r})\right]$ , $y\in\mathbb{R}^{d}$ , for $r\in\{0,\dots,n_{0}-1\}$ . For that purpose, we consider a parametric family of functions from $\mathbb{R}^{d}$ to $\mathbb{R}$ , denoted $\{Q_{r,\bm{\beta}},\bm{\beta}\in\mathcal{B},r\in\{0,\dots,n_{0}-1\}\}$ , where $\mathcal{B}\subset\mathbb{R}^{b_{0}}$ . There are many ways to define such family of functions. The simplest idea is to select a family of functions $\{\psi_{b}\}_{b=1}^{b_{0}}$ , $\psi_{b}:\mathbb{R}^{d}\to\mathbb{R}$ and to set

[TABLE]

However, this is not necessarily the best choice when the prediction functions $Q_{r}$ have a specific structure. For example, for Metropolis-Hastings algorithms, the sampling step uses an accept/reject step. In such case, it is more appropriate to consider predictors of the form

[TABLE]

In this decomposition, $\bar{\alpha}_{r,\bm{\beta}}(y)$ estimates the $r$ -th step rejection probability, i.e. the probability of observing $r$ successive rejection. This probability can be estimated by logistic regression.

The parameter vector $\bm{\beta}$ are estimated via the least-squares approach. More precisely, for $r\in\{0,\dots,n_{0}-1\}$ , i.e. we solve

[TABLE]

Finally, we compute the estimates $\widehat{a}_{r,k}$ of the functions $\bar{a}_{r,k}$ (see (13)). Namely, for all $y\in\mathbb{R}^{d}$ we define

[TABLE]

where $\Phi$ is defined in (2). Note that in some relevant cases (e.g. for $\Phi$ being linear in $z$ and the regression function being a linear combination of basis functions as in (30)), the expectation in (33) can be computed in closed form. When direct integration is not an option, we use Monte Carlo or Quasi Monte Carlo to compute the integrals $\int\phi_{k}(z)\psi_{b}(\Phi(y,z))P_{\xi}(\mathrm{d}z)$ . The complexity of this parametric integration problem is well studied. In order to increase efficiency, one can also employ the Multilevel Monte Carlo approach (see Heinrich and Sindambiwe [1999]). The estimator obtained by plugging (33) into (27) and (28) is referred to as the MAD-CV (MArtingale Decomposition Control Variate) estimator.

The resulting estimate

[TABLE]

with

[TABLE]

remains unbiased for $\pi(f)$ (if computed on a new trajectory independent of regression data) and has a variance

[TABLE]

While the first and the second terms will be studied in Section 5 in some special cases, the last one has to be analyzed separately for different approximation schemes, and we leave this analysis for future research. Nevertheless already this decomposition shows that under the conditions of Proposition 3, it holds for a fixed $K>0,$

[TABLE]

provided that the expectations $\mathsf{E}[A^{2}_{n-l+1,k}],$ $\mathsf{E}[\widehat{A}^{2}_{n-l+1,k,n_{0}}]$ are uniformly bounded for $l=1,\ldots,n,$ in $n\in\mathbb{N}$ and $k=1,\ldots,K$ . The latter property of the estimates $\widehat{A}_{n-l+1,k,n_{0}}$ can be achieved by using an additional truncation step in regression, see e.g. Györfi et al. [2006] for various truncation schemes.

5 Gaussian noise model

We analyze the MAD-CV algorithm for the Markov chains $(X^{x}_{p})_{p\geq 0}$ driven by a normal noise, that is,

[TABLE]

For a multi-index $\mathbf{k}=(k_{i})\in\mathbb{N}_{0}^{d}$ , we denote by $\mathbf{H}_{\mathbf{k}}(x)$ the normalized Hermite polynomial on $\mathbb{R}^{d}$ , that is, $\mathbf{H}_{\mathbf{k}}(x):=\prod_{i=1}^{d}H_{k_{i}}(x_{i}),\,x=(x_{i})\in\mathbb{R}^{d}$ with $H_{k_{i}}$ defined in (22). The following notations are used in the sequel: $\|\mathbf{k}\|=\max\limits_{i\in\{1,\ldots,d\}}k_{i}$ , $|\mathbf{k}|=\sum_{i=1}^{d}k_{i}$ and $\mathbf{k}!:=k_{1}!\dots k_{d}!$ . In this case $a_{r,\mathbf{k}}(y)=\mathsf{E}\left[f(X^{y}_{r})\mathbf{H}_{\mathbf{k}}\left(Z_{1}\right)\right]$ , $A_{q,\mathbf{k}}(y)=\sum_{r=1}^{q}\bar{a}_{r,\mathbf{k}}(y)$ and the martingale $M^{(x,K)}_{n}$ takes the form

[TABLE]

Recall that $\mathcal{G}_{p}=\sigma(Z_{1},\ldots,Z_{p})$ , $p\in\mathbb{N}$ , and $\mathcal{G}_{0}=\mathrm{triv}$ . For a twice differentiable function $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , we denote by $D^{2}g(x)$ its Hessian at point $x$ . For a smooth function $g\colon\mathbb{\mathbb{R}}^{d}\to\mathbb{R}$ , a multi-index $\mathbf{k}\in\mathbb{N}_{0}^{d}$ , we use the notation $g^{(\mathbf{k})}(x)$ for the partial derivative

[TABLE]

For $m\in\mathbb{N}$ , a smooth function $h\colon\mathbb{\mathbb{R}}^{d\times m}\to\mathbb{R}$ with arguments being denoted $(z_{1},\ldots,z_{m})$ , $z_{i}\in\mathbb{R}^{d}$ , $i=1,\ldots,m$ , a multi-index $\mathbf{k}=(k_{i})\in\mathbb{N}_{0}^{d}$ , and $j\in\{1,\ldots,m\}$ , we use the notation $\partial^{\mathbf{k}}_{z_{j}}h$ for the multiple derivative of $h$ with respect to the components of $z_{j}$ :

[TABLE]

where $z_{1:m}=(z_{1},\dots,z_{m})$ and $\partial^{k_{s}}_{z_{j,s}}h(z_{1:m})$ stands for partial derivative of order $k_{s}$ for the function $h$ with respect to the $s$ th - coordinate of the vector $z_{j}.$ For $m\in\mathbb{N}$ and $j\leq m$ , we denote

[TABLE]

By setting $G_{p}=G_{0,p}$ , where $G_{0,p}$ is defined in (8), we obtain $f\left(X_{p}^{x}\right)=[f\circ G_{p}](x,Z_{1:p})$ .

We preface the derivations with an auxiliary lemma which provides us with a representation for the coefficients $\bar{a}_{p,\mathbf{k}}$ defined in (13). Let $K\in\mathbb{N}$ and consider the following assumptions

H 1.

The function $\Phi:\mathbb{R}^{d}\times\mathbb{R}^{m}\to\mathbb{R}^{d}$ is $K\times d$ times continuously differentiable.

H 2.

The function $f:\mathbb{R}^{d}\to\mathbb{R}$ is $K\times d$ times continuously differentiable.

Lemma \thelemma.

Assume H 1 and H 2 and that (36) holds. Then for any $\mathbf{k},\mathbf{k}^{\prime}\in\mathbb{N}_{0}^{d}$ such that $\mathbf{k}^{\prime}\leq\mathbf{k}$ componentwise and $\|\mathbf{k}^{\prime}\|\leq K$ , any $x\in\mathbb{R}^{d}$ , the following representation holds

[TABLE]

Proof.

The proof is postponed to Section 7.3. ∎

Under some additional smoothness assumptions one can derive a useful bound for the sum of the functions $A^{2}_{q,\mathbf{k}}$ which can be directly used to bound the variance of additive functionals (19).

Proposition \theproposition.

Assume H 1, H 2, and that (36) holds. Then for any $x\in\mathbb{R}^{d}$ , it holds

[TABLE]

where for a non-empty subset $I\subseteq\{1,\ldots,d\}$ , we denote $\mathbf{K}_{I}=K(\mathbbm{1}_{\{1\in I\}}\,\ldots,\mathbbm{1}_{\{d\in I\}})$ .

Proof.

The proof is postponed to Section 7.4. ∎

We aim at applying our main result to the estimation of expectations under the stationary distribution of ergodic diffusion processes. Let $b(x)=(b_{1}(x),\dots,b_{d}(x))$ be a drift function, $(\mathsf{W}_{t})_{t\geq 0}$ be a $d-$ dimensional Wiener process and assume that the stochastic differential equation

[TABLE]

admits a unique strong solution $(\mathsf{X}^{x}_{t})_{t\geq 0}$ for any $x\in\mathbb{R}^{d}$ . We consider the Euler-Maruyama discretization of the SDE (40), i.e. the homogeneous Markov chain $(X_{k}^{x})_{k\geq 0}$ , starting from $X_{0}^{x}=x\in\mathbb{R}^{d}$ and defined by the following recursion: for any $k\in\mathbb{N}$ ,

[TABLE]

where $\gamma>0$ is a stepsize and $(Z_{k})_{k\in\mathbb{N}}$ is a sequence of i.i.d. $d-$ dimensional Gaussian random variables with zero mean and identity covariance matrix. Note that the recurrence (41) is a particular case of the general scheme (36) with $\Phi(x,z)=x-\gamma b(x)+\sqrt{\gamma}z$ . We impose some standard technical conditions on the drift function $b$ , following Bortoli and Durmus [2020], namely,

H 3.

There exist a constant $L>0$ , such that $\|b(x)-b(y)\|\leq L\|x-y\|$ for any $x,y\in\mathbb{R}^{d}$ .

H 4.

There exist a constant $m>0$ , such that $\langle b(x)-b(y),x-y\rangle\geq m\|x-y\|^{2}$ for any $x,y\in\mathbb{R}^{d}$ .

Under the assumptions H 3 and H 4, Section 5 can be used to bound the variance of additive functionals of the Markov chains of the form (41).

Theorem 2.

Let $(X_{k}^{x})_{k\geq 0}$ be a Markov chain given by the recurrence (41), and assume that H 2, H 3, and H 4 hold. Let $K\in\mathbb{N}$ . Assume in addition that there exist constants $C_{f}$ and $C_{b}$ , such that for any $x\in\mathbb{R}^{d}$ , any multi-index $\mathbf{k}\in\mathbb{N}_{0}^{d}$ with $0<\|\mathbf{k}\|\leq K$ , and any $u\in\{1,\dots,d\}$ ,

[TABLE]

Then, for $0<\gamma<\min(1/C_{b},m/\mathrm{L}^{2})$ and any $n\in\mathbb{N}$ ,

[TABLE]

Moreover, with the truncation point $n_{0}(\gamma)=\lceil K\log{\gamma^{-1}}/(2m\gamma)\rceil$ , variance of the truncated estimate $\pi_{n,n_{0}(\gamma)}^{(x,K)}(f)$ can be bounded as

[TABLE]

where $\lesssim$ stands for inequality up to a constant not depending on $\gamma$ and $n$ .

Proof.

The proof is postponed to Section 7.5. ∎

Theorem 2 shows that under some conditions the variance of the estimate $\pi^{(x,K)}_{n}(f)$ in the diffusion case (41) satisfies

[TABLE]

At the same time, the variance of the standard Monte Carlo estimate $\pi^{x}_{n}(f)$ is of order $1/(n\gamma)$ and this order can not be reduced in general, see Example 3.1. Thus, for $K\geq 2$ and $\gamma$ small enough we have a clear variance reduction effect.

Remark \theremark.

In the particular case of the Unadjusted Langevin algorithm (Example 2.2), assumptions of the Proposition 2 can be verified for the smooth and strongly convex potential $U$ , that is, for $U\in C^{2}(\mathbb{R}^{d})$ and

[TABLE]

for some $m_{U}>0,\,M_{U}>0$ , and any $x,y\in\mathbb{R}^{d}$ .

6 Numerical experiments

In this chapter we evaluate our MAD-CV control variates on different model examples. Code to reproduce the experiments is available at https://github.com/svsamsonov/MAD-CV.

6.1 Example 3.1 (continue)

In this subsection we complete the Example 3.1 by evaluating the estimator $\hat{\pi}^{(x,1)}_{n,n_{0}}(f)=\pi^{x}_{n}(f)-n^{-1}\hat{M}^{x}_{n,1,n_{0}}$ . Recall that we consider samples from the Gaussian distribution with density $\pi(x)=(\sqrt{\pi})^{-1}\mathrm{e}^{-x^{2}}$ using the ULA algorithm (see (21)) and take $f(x)=x^{2}$ . We use different step sizes $\gamma\in(0.05,0.5)$ and sample training trajectory of length $5\times 10^{4}$ for each step size. We solve the least squares problems (32) with basis $\{1,x,x^{2}\}$ and the truncation points $n_{0}=5+\lceil\log{\gamma^{-1}}/(4\gamma)\rceil$ . Then we construct the control variate $\widehat{M}^{(x,1)}_{n,n_{0}}$ defined in (35). Our goal is to compare the variance of the truncated estimator $\hat{\pi}^{(x,1)}_{n,n_{0}}(f)$ to the variance of the ”perfect” (with exact coefficients $\bar{a}$ ) estimator $\pi^{(x,1)}_{n,n_{0}}(f)$ . To this end, we we show two quantities. The first one is the ratio $\mathsf{Var}\left[\pi^{x}_{n}(f)\right]\Bigl{/}\mathsf{Var}\left[\pi^{(x,1)}_{n,n_{0}}(f)\right]$ , which can be computed analytically for different $\gamma\in[0.05,0.5]$ . The second one is the sample counterpart of the ratio $\mathsf{Var}\left[\pi^{x}_{n}(f)\right]\Bigl{/}\mathsf{Var}\left[\hat{\pi}^{(x,1)}_{n,n_{0}}(f)\right]$ , computed over $100$ independent replications of $100$ test trajectories, each of length $n=1\times 10^{4}$ . Left panel of Figure 2 contains error bars for $\mathsf{Var}\left[\pi^{x}_{n}(f)\right]\Bigl{/}\mathsf{Var}\left[\hat{\pi}^{(x,1)}_{n,n_{0}}(f)\right]$ and indicates that the use of regression does not lead to a significant drop in the algorithm’s performance.

Next we aim to illustrate the results of Theorem 2 by comparing $\mathsf{Var}\bigl{[}\hat{\pi}^{(x,1)}_{n,n_{0}}(f)\bigr{]}$ to $\mathsf{Var}\bigl{[}\hat{\pi}^{(x,2)}_{n,n_{0}}(f)\bigr{]}$ . We fix $f(y)=\sin{y}$ and use different step sizes $\gamma\in[0.1,0.5]$ . The least squares problems (32) are solved with the regressors $\{1,x,x^{2},x^{3},x^{4}\}$ . Following Theorem 2, we set the truncation points $n_{0}=10+\lceil\log{\gamma^{-1}}/(2\gamma)\rceil$ for $\hat{\pi}^{(x,1)}_{n,n_{0}}(f)$ and $n_{0}=10+\lceil\log{\gamma^{-1}}/\gamma\rceil$ for $\hat{\pi}^{(x,2)}_{n,n_{0}}(f)$ , respectively. We compute the sample variance reduction factors $\mathsf{Var}\bigl{[}\pi^{x}_{n}(f)\bigr{]}\Bigl{/}\mathsf{Var}\bigl{[}\hat{\pi}^{(x,K)}_{n,n_{0}}(f)\bigr{]},K=1,2$ over $100$ independent trajectories of length $2\times 10^{3}$ , repeat this procedure $100$ times and report the averaged variance reduction factors in the upper right panel of Figure 2. To highlight the gain of the estimate $\hat{\pi}^{(x,2)}_{n,n_{0}}(f)$ , we report on the lower right panel of Figure 2 the averaged ratios $\mathsf{Var}\bigl{[}\hat{\pi}^{(x,1)}_{n,n_{0}}(f)\bigr{]}/\mathsf{Var}\bigl{[}\hat{\pi}^{(x,2)}_{n,n_{0}}(f)\bigr{]}.$ Note that they scale approximately as $1/\gamma$ , as predicted by Theorem 2.

6.2 Comparison with vanilla ULA

We compare the variance reduction versus cost achieved by MAD-CV against plain Monte Carlo for the ULA algorithm. We consider samples generated by ULA, where $\pi$ is either the standard normal distribution in dimension $d$ or the mixture of two $d-$ dimensional standard Gaussian distributions of the form

[TABLE]

We fix $d=2$ and $\mu=(0.5,0.5)$ . In both examples, our goal is to estimate $\pi(f)$ with $f(x)=x_{1}+x_{2}$ and $f(x)=x_{1}^{2}+x_{2}^{2}$ . We use a constant step size $\gamma=0.2$ and sample training trajectory of length $5\times 10^{4}$ with the starting point $X_{0}=x=(1,1)$ . Then we solve the least squares problems (32) with the class of regressors $\{x_{1},x_{2},x_{1}^{2},x_{1}x_{2},x_{2}^{2}\}$ for the different choices of truncation point $n_{0}\in[2,20]$ . We construct the control variate $M_{n,n_{0}}^{(x,K)}$ , defined in (37). We finally estimate the cost-to-variance ratio (degree of variance reduction relative to costs) as follows

[TABLE]

by its empirical counterpart, computed over $100$ independent trajectories, each of length $n=5\times 10^{4}$ . Note that for $2-$ dimensional standard Gaussian vector $Z=(Z_{1},Z_{2})$ , for multi-indices $\mathbf{k}=(k_{1},k_{2})\in\{(2,1),(1,2),(2,2)\}$ it holds that

[TABLE]

for any $\psi(x)\in\{x_{1},x_{2},x_{1}^{2},x_{1}x_{2},x_{2}^{2}\}$ . This implies that the coefficients $\bar{a}_{r,\mathbf{k}}(x)=0$ , for any $x\in\mathbb{R}^{d}$ and $\mathbf{k}=(k_{1},k_{2})\in\{(2,1),(1,2),(2,2)\}$ . Since for a fixed $K$ the cost of computing $\pi^{x}_{n}(f)$ is proportional to the cost of computing function $f$ , we set for $K=1$

[TABLE]

since for the fixed $r$ , each coefficient $\bar{a}_{r,\mathbf{k}}(y)$ is a polynomial, which can be computed at the same cost as $f$ . Similarly, for $K=2$ we set

[TABLE]

since we need to evaluate $5n_{0}$ coefficients in addition to each evaluation of $f$ . Variance reduction costs for Gaussian distribution and different truncation points $n_{0}$ are summarized in Figure 3, and for the Gaussian mixture - in Figure 4. Note that for both examples MAD-CV allows us to obtain a significant gain in terms of cost-to-variance ratios.

6.3 Random Walk Metropolis (RWM) example

We illustrate the application of MAD-CV to the RWM algorithm. RWM is an MCMC algorithm using random walk proposal. Let $\{U_{p}\}_{p=1}^{\infty}$ and $\{Z_{p}\}_{p=1}^{\infty}$ be independent i.i.d. sequences, with $U_{p}\sim\mathrm{Unif}[0,1]$ and $Z_{p}\sim\mathcal{N}(0,\mathrm{I}_{d})$ . Then the $p$ -th RWM iterate writes as

[TABLE]

where $\alpha(x,y)=\min\bigl{\{}1,\pi(y)/\pi(x)\bigr{\}}$ is the acceptance ratio. In this experiment $\pi$ is set to be the standard normal distribution in dimension $d=2$ . We aim to estimate $\pi(f)$ with $f(x)=x_{1}^{2}+x_{2}^{2}$ , using RWM. The variance of the incremental distribution is determined by $\gamma=1.0$ , which leads to an acceptance rate in stationarity of approximately $0.55$ . We sample a training trajectory of length $N=10^{6}$ , and solve the regression problem (32) with the polynomial regressors $x_{1}^{d_{1}}x_{2}^{d_{2}}$ , $d_{1}+d_{2}\leq 4$ . We illustrate that MAD-CV can benefit when taking into account both randomness in $U_{p}$ and $Z_{p}$ . Namely, for a multi-index $\mathbf{k}=(k_{1},k_{2},k_{3})$ , $z\in\mathbb{R}^{2}$ and $u\in\mathbb{R}$ , we consider basis functions

[TABLE]

with $(P_{k})_{k\in\mathbb{N}}$ being shifted Legendre polynomials on $[0,1]$ . We use QMC to evaluate the corresponding functions $\widehat{a}_{r+1,\mathbf{k}}$ in (33). We write $\widehat{\pi}_{n,n_{0}}^{(x,\mathbf{K})}(f),\mathbf{K}=(K_{1},K_{2},K_{3})$ for the version of estimator (34) based on the coefficients $\widehat{a}_{r+1,\mathbf{k}}$ , $\mathbf{k}\leq\mathbf{K}$ .

To test our variance reduction algorithm, we generate $100$ independent trajectories of length $n=1\times 10^{3}$ . We use $4$ Hermite polynomials in each coordinate (that is, $k_{1},k_{2}\in\{1,\dots,4\}$ ). The compared cases are when the MAD-CV are only applied to the proposal (44) (that is, $\mathbf{K}=(4,4,1)$ ), and when the MAD-CV are applied jointly on the proposal and the acceptance step ( $\mathbf{K}=(4,4,20)$ ). Figure 5 displays the boxplots of the estimates together with the estimated standard deviations of the corresponding estimates $\widehat{\pi}^{(x,\mathbf{k})}_{n,n_{0}}$ for different truncations $n_{0}\in\{2,\dots,20\}$ . Note that combining Legendre and Hermite polynomials allows to achieve better variance reduction compared to the case when only Hermite polynomials are used.

6.4 Euler scheme for discretized diffusion

We consider the $d-$ dimensional stochastic differential equation

[TABLE]

with the drift function

[TABLE]

We aim at estimating $\pi(f)$ for the functions $f(x)=\sum\limits_{i=1}^{d}x_{i}$ and $f(x)=\sum\limits_{i=1}^{d}x_{i}^{2}$ , where $\pi$ is an ergodic distribution of (46). We fix $d=5$ , $a=0.5$ , and consider the Euler-Maruyama discretization of (46) with constant stepsize $\gamma=0.1$ , and approximate $\pi(f)$ by $\pi^{x}_{n}(f)$ and its MAD-CV counterparts. Note that the assumptions of Theorem 2 are satisfied. We consider estimators $\pi_{n,n_{0}}^{(K)}(f)$ with $n_{0}=20$ and $K=1$ or $K=2$ . We refer to them as to MAD-CV-1 and MAD-CV-2, respectively. First we sample a training trajectory of length $N=10^{4}$ , and solve the regression problem (32) with the class of regressors $\{x_{i},x_{j}x_{k}\}$ for $i,j,k\in\{1,\dots,d\}$ . To test our variance reduction algorithm, we generate $100$ independent test trajectories of length $5\times 10^{3}$ and compute $\pi^{x}_{n}(f)$ and $\pi^{(x,K)}_{n,n_{0}}(f)$ . The corresponding boxplots are presented in Figure 6.

6.5 Lotka-Volterra model with feedback control

We consider the stochastic Lotka-Volterra predator-prey model with feedback control, following Liu and Zhao [2019]:

[TABLE]

where $\mathsf{W}_{i,t}\,(i=1,2)$ denote independent Wiener processes, parameters $a_{i,i}>0$ correspond to intraspecific competition rates, $a_{i,j},i\neq j$ stand for capturing rates of the prey and predator, $r_{i}>0$ represent the intrinsic growth rate of the population and $\sigma_{i}^{2}>0$ . We consider Euler-Maruyama discretisation of the equation (47) with step size $\gamma=0.1$ , and fix the hyperparameters

[TABLE]

Note that the assumptions $(A_{1})$ and $(A_{2})$ from Liu and Zhao [2019], namely $a_{1,1}a_{2,2}>a_{2,1}a_{1,2}$ and $r_{1}/r_{2}>a_{1,1}/(a_{2,2}+ch/e)$ are satisfied, and the system (47) oscillates around its equillibrium point. We fix $f(x)=x_{1}$ or $f(x)=x_{2}$ , and aim at approximating $\pi(f)$ by $\pi^{x}_{n}(f)$ and its variance-reduced counterparts. We sample a training trajectory of length $N=5\times 10^{3}$ , and solve the regression problem (32) with the class of regressors $\{x_{i},x_{j}x_{k}\}$ for $i,j,k\in\{1,\dots,d\}$ . We set the truncation level $n_{0}=50$ and generate $100$ independent test trajectories of length $n=5\times 10^{3}$ . For each trajectory we compute $\pi^{x}_{n}(f)$ and its variance-reduced counterpart $\pi_{n,n_{0}}^{(x,1)}(f)$ . We show the simulated trajectories of the system (47) in Figure 8a, and the boxplots corresponding to $f(x)=x_{1}$ in Figure 8b.

6.6 Multidimensional stochastic Lotka-Volterra model

Following Mao et al. [2003], we consider Lotka–Volterra model for a system with $d$ interacting components, corresponding to the case of facultative mutualism, namely

[TABLE]

where

[TABLE]

Stochastically perturbing parameters $a_{i,j}$ , we come up with the system

[TABLE]

where $(\mathsf{W}_{i,t})_{t\geq 0}\,(i=1,\dots,d)$ denote independent Wiener processes and $\Sigma=(\sigma_{i,j})\in\mathbb{R}^{d\times d}$ is a matrix with $\sigma_{i,i}>0$ and $\sigma_{i,j}\geq 0,i\neq j$ . We consider the Euler-Maruyama discretisation of the equation (48) with constant step size $\gamma=0.02$ , and fix the hyperparameters

[TABLE]

We also set $\sigma_{i,i}=0.1,\,\sigma_{i,j}=0,\,i\neq j$ . Note that the conditions of [Mao et al., 2003, Theorem 1] are satisfied, and the system (47) has a unique positive solution. We fix $f(x)=x_{1}$ , and aim at approximating $\pi(f)$ by $\pi^{x}_{n}(f)$ and its variance-reduced counterparts. We sample a training trajectory of length $N=1\times 10^{4}$ , and solve the regression problem (32) with the class of regressors $\{x_{i},x_{j}x_{k}\}$ for $i,j,k\in\{1,\dots,d\}$ . We set the truncation level $n_{0}=50$ and generate $100$ independent test trajectories of length $n=5\times 10^{3}$ . For each trajectory we compute $\pi^{x}_{n}(f)$ and its variance-reduced counterpart $\pi_{n,n_{0}}^{(x,1)}(f)$ . We show the simulated trajectories of the system (47) in Figure 8a, and the boxplots corresponding to $f(x)=x_{1}$ in Figure 8b.

7 Proofs

7.1 Notations.

For multi-indices $\mathbf{k}=(k_{1},\dots,k_{d})$ and $\bm{\ell}=(l_{1},\dots,l_{d})\in\mathbb{N}_{0}^{d}$ , such that $|\mathbf{k}|>0$ , $|\bm{\ell}|>0$ , for $m\in\mathbb{N}$ and smooth function $f:\mathbb{R}^{d\times m}\rightarrow\mathbb{R}^{d}$ , such that

[TABLE]

we write

[TABLE]

We write $\operatorname{J}_{f}(z_{1:m})\in\mathbb{R}^{dm\times d}$ for the Jacobian of $f$ at point $z_{1:m}$ , and $\nabla_{z_{i}}f\in\mathbb{R}^{d\times d}$ for the matrix of partial derivatives $(\operatorname{J}_{f}^{(z_{i})})_{u,v}=\partial_{z_{i,u}}f_{v}(z_{1:m})$ . For the multi-indices $\mathbf{k},\bm{\ell}\in\mathbb{N}_{0}^{d}$ , we write that $\mathbf{k}\prec\bm{\ell}$ , if one of the following holds:

•

$|\mathbf{k}|<|\bm{\ell}|$ ;

•

$|\mathbf{k}|=|\bm{\ell}|$ and $\mathbf{k}_{1}<\mathbf{k}_{1}$ , or

•

$|\mathbf{k}|=|\bm{\ell}|$ , $\mathbf{k}_{1}=\bm{\ell}_{1}$ , …, $\mathbf{k}_{m}=\bm{\ell}_{m}$ , and $\mathbf{k}_{m+1}=\bm{\ell}_{m+1}$ .

For the multi-indices $\mathbf{q},\mathbf{r}\in\mathbb{N}_{0}^{d}$ we define $P(\mathbf{q},\mathbf{r})$ the set of multi-indices $\mathbf{k}_{i},\bm{\ell}_{i}\in\mathbb{N}_{0}^{d}$ , such that for some $1\leq s\leq|\mathbf{q}|$ , $\mathbf{k}_{i}=0$ and $\bm{\ell}_{i}=0$ for $1\leq i\leq|\mathbf{q}|-s$ , $|\mathbf{k}_{i}|>0$ for $|\mathbf{q}|-s+1\leq i\leq|\mathbf{q}|$ and $0\prec\ell_{|\mathbf{q}|-s+1}\prec\dots\prec\ell_{|\mathbf{q}|}$ are such that

[TABLE]

7.2 Proof of Theorem 1

The expansion obviously holds for any $q=1$ and $j=0$ . Indeed, since $\left(\phi_{k}\right)_{k\geq 0}$ is a complete orthonormal system in $\mathrm{L}^{2}(\mathbb{R}^{m},P_{\xi})$ , it holds in $\mathrm{L}^{2}(\mathbb{R}^{m},P_{\xi_{1}})$ that

[TABLE]

for any bounded $f$ with $a_{1,1,k}(x)=\mathsf{E}[f(X_{1}^{x})\phi_{k}(\xi_{1})]$ . Assume now that (9) holds for any $q\leq v$ , $j<q$ and bounded measurable functions $f$ . Let us prove that the induction assumption holds for $q=v+1$ and any $j<v+1$ . Denote for $n,k\in\mathbb{N}$ and $y\in\mathbb{R}^{d}$ ,

[TABLE]

The orthonormality and completeness of the system $\left(\phi_{k}\right)_{k=0}^{\infty}$ implies that

[TABLE]

The Parseval inequality implies that

[TABLE]

By construction, $X_{v+1}^{x}=X_{v,v+1}^{X_{v}^{x}}$ and $\int f(\Phi_{v}(X_{v}^{x},e_{v+1}))P_{\xi}(\mathrm{d}e_{v+1})=\mathsf{E}\left[\left.f(X_{v+1}^{x})\,\right|\mathcal{G}_{v}\right]$ $\mathsf{P}$ -a.s. Hence, using (53) and (54), we get that

[TABLE]

or equivalently

[TABLE]

in $\mathrm{L}^{2}(\mathbb{R}^{mq},P^{\otimes q}_{\xi})$ which is the required statement in the case $q=v+1$ and $j=v$ . Consider now the case $q=v+1$ and $j<v$ . Set $g(y)=\int f\circ\Phi_{v}(y,e_{v+1})P_{\xi}(\mathrm{d}e_{v+1})$ . Note that $\mathsf{P}$ -a.s. it holds $g(X_{v}^{x})=\mathsf{E}\left[\left.f(X_{v}^{x})\,\right|\mathcal{G}_{v}\right]$ and $g$ is bounded by construction. Hence, we may apply the induction hypothesis to function the bounded measurable function, which implies

[TABLE]

with $a_{v+1,l,k}(y)=\mathsf{E}[g(X_{l-1,v}^{y})\phi_{k}(\xi_{l})]$ . Using that $g(X_{l-1,v}^{y})=\int f\circ\Phi_{v}(X_{l-1,v}^{y},e_{v+1})P_{\xi}(\mathrm{d}e_{v+1})$ , and $\Phi_{v}(G_{l-1,v}(y,e_{l},\dots,e_{v}),e_{v+1})=G_{l-1,v+1}(y,e_{l},\dots,e_{v+1})$ , w

[TABLE]

Eqs.(55) and (56) conclude the induction step for $q=v+1$ and all $j<v+1$ and hence the proof.

7.3 Proof of Section 5

Applying the integration by parts in vector form (below $\prod_{j=l+1}^{p}:=1$ whenever $l\geq p$ ),

[TABLE]

The last expression yields the result.

7.4 Proof of Section 5

For multi-indices $\mathbf{k},\mathbf{k}^{\prime}\in\mathbb{N}_{0}^{d}$ with $\mathbf{k}^{\prime}\leq\mathbf{k}$ componentwise and $\mathbf{k}^{\prime}\neq\mathbf{k}$ , $\|k^{\prime}\|\leq K$ , we obtain from Lemma 5, that for $q\in\mathbb{N}$ ,

[TABLE]

where $A_{q,\mathbf{k}}$ is defined in (14). Given $\mathbf{k}\in\mathbb{N}_{0}^{d}$ , by taking $\mathbf{k}^{\prime}=\mathbf{k}^{\prime}(\mathbf{k})=K(\mathbbm{1}_{\{k_{1}>K\}}\,\ldots,\mathbbm{1}_{\{k_{d}>K\}})$ , we get

[TABLE]

where for any two multi-indices $\mathbf{r},$ $\mathbf{q}$ from $\mathbb{N}_{0}^{d}$ we have defined

[TABLE]

In (57) the first sum runs over all nonempty subsets $I$ of the set $\{1,\ldots,d\}.$ For any subset $I,$ $\mathbb{N}_{I}^{d}$ stands for a set of multi-indices $\mathbf{m}_{I}$ with elements $m_{i}=0,$ $i\not\in I,$ and $m_{i}\in\mathbb{N},$ $i\in I.$ Moreover, $I^{c}=\{1,\ldots,d\}\setminus I$ and $\mathbb{N}^{d}_{0,I^{c}}$ stands for a set of multi-indices $\mathbf{m}_{I^{c}}$ with elements $m_{i}=0,$ $i\in I,$ and $m_{i}\in\mathbb{N}_{0},$ $i\not\in I$ . Applying the estimate

[TABLE]

we get

[TABLE]

The Parseval identity implies that for any function $\varphi:\mathbb{R}^{d}\to\mathbb{R}$ satisfying $\mathsf{E}[\varphi^{2}(Z_{1})]<\infty$ ,

[TABLE]

Using this identity in (58) implies

[TABLE]

The sum $\sum_{p=1}^{q}\partial_{z_{1}}^{\mathbf{K}_{I}}f(X_{p}^{x})$ is a function of $x,Z_{1:q}$ : $\sum_{p=1}^{q}\partial_{z_{1}}^{\mathbf{K}_{I}}f(X_{p}^{x})=F(x,Z_{1:q})$ . By the Gaussian Poincaré inequality Boucheron et al. [2013], we have

[TABLE]

where $\nabla_{z}F=(\nabla_{z_{1}}F,\ldots,\nabla_{z_{q}}F)$ and $\nabla_{z_{j}}$ is defined in (38). Hence,

[TABLE]

Note that $\nabla_{z_{j}}\partial_{z_{1}}^{\mathbf{K}_{I}}f(X_{p}^{x})=0$ for $p<j$ . Together with (59) this implies the statement (39).

7.5 Proof of Theorem 2

Recall that, due to (19), for $K\in\mathbb{N}$ ,

[TABLE]

By Section 5, for fixed $q\in\mathbb{N}$ , and any $x\in\mathbb{R}^{d}$ ,

[TABLE]

Now we fix $p$ and $j$ in $\{1,\ldots,q\}$ , such that $p\geq j$ , and a non-empty subset $I\subseteq\{1,\ldots,d\}$ . By the multivariate Faà di Bruno’s formula Constantine and Savits [1996],

[TABLE]

where the set $P(\mathbf{K}_{I},\mathbf{r})$ is defined in Section 7.1. Hence, we obtain

[TABLE]

Using the bounds of Section 7.6 and Section 7.6, we obtain

[TABLE]

with a suitable constant $A_{K|I|}$ . Substituting into (61), we obtain

[TABLE]

with a constant $B_{K}$ not depending on $\gamma$ and $n$ . Hence, due to (60), we obtain

[TABLE]

For the truncated estimate $\pi_{n,n_{0}}^{(K)}(f)$ , we obtain using (29) that

[TABLE]

Let us bound the quantity $\sum_{1\leq\|\mathbf{k}\|\leq K}\bigl{[}\sum_{r=n_{0}+1}^{n-l}\bar{a}_{r,\mathbf{k}}(y)\bigr{]}^{2}$ for $y\in\mathbb{R}$ . First, let us show that for any $r\in\mathbb{N}$ , any $x,y\in\mathbb{R}$ ,

[TABLE]

Indeed, it holds

[TABLE]

provided that $\gamma\in(0,m/\mathrm{L}^{2})$ . Then

[TABLE]

and (64) follows. Note that, by the definition of $\bar{a}_{r,\mathbf{k}}$ ,

[TABLE]

The Parseval identity implies that

[TABLE]

Hence, selecting $n_{0}=n_{0}(\gamma)$ such that $(1-\gamma m)^{2n_{0}-2}\leq\gamma^{K-1}$ for $\gamma\searrow 0_{+}$ , we obtain using (63) that

[TABLE]

7.6 Auxiliary lemmas

Lemma \thelemma.

Under the assumptions of Theorem 2, for any multi-index $\mathbf{k}\in\mathbb{N}_{0}^{d}$ with $|\mathbf{k}|\geq 1$ , any $j\geq 1$ , and $p>j$ , it holds

[TABLE]

with constant $C_{|\mathbf{k}|}$ not depending on $\gamma,j,$ and $p$ . For $p\leq j$ , it holds $\nabla_{z_{j}}\partial_{z_{1}}^{\mathbf{k}}X^{x}_{p}=0$ .

Proof.

In the proof we use the notation

[TABLE]

For $v\in\{1,\dots,d\}$ , we also write $\mathbf{e}_{v}$ for $v-$ th coordinate basis vector, that is, $\mathbf{e}_{v}\in\mathbb{N}_{0}^{d}$ and $(\mathbf{e}_{v})_{i}=\mathbbm{1}_{\{i\}}(v)$ for $i\in\{1,\dots,d\}$ .

We preface the lemma by some elementary but useful identities. For any multi-index $\mathbf{k}$ with $|\mathbf{k}|=1$ , any $i<p$ , it holds

[TABLE]

Since $X_{p}^{x}=G_{p}(x,Z_{1:p})$ , obviously $\nabla_{z_{j}}\partial_{z_{1}}^{\mathbf{k}}X^{x}_{p}=0$ for $p<j$ . For $p=j$ , the statement of the lemma follows from (68). Now we consider the case $p>j$ . Fix $j\in\mathbb{N}$ and prove (65) for all $p>j$ by induction in $|\mathbf{k}|$ . We start from $|\mathbf{k}|=1$ . Note that for a given index $j$ and $u,v\in\{1,\dots,d\}$ , the relation (67) implies

[TABLE]

Hence, we can write that

[TABLE]

where the matrix $H(X^{x}_{p-1})\in\mathbb{R}^{d\times d}$ , with the entries

[TABLE]

The recurrence (69) implies that

[TABLE]

To bound $\left\|H(X^{x}_{p-1})\right\|_{\mathrm{F}}$ , we observe

[TABLE]

Hence, using (68) and (70), we obtain

[TABLE]

Now we can apply Section 7.6 to bound $\left\|\nabla_{z_{j}}\partial^{\mathbf{k}}_{z_{1}}X^{x}_{p}\right\|_{\mathrm{F}}$ with $\bar{C}_{1}=\gamma^{2}d^{2}C_{b}\prod_{k=2}^{j}\alpha_{k}$ , and, using Section 7.6, we obtain for all $\gamma\in(0,m/\mathrm{L}^{2})$ , that

[TABLE]

which imply (65) for any multi-index $\mathbf{k}$ with $|\mathbf{k}|=1$ with the constant $C_{1}=2C_{b}d^{2}/m$ .

The induction hypothesis is therefore that the inequality

[TABLE]

holds for all multi-indices $\mathbf{q}$ with $|\mathbf{q}|<r\leq Kd$ and $p>j$ . We need to show (71) for all multi-indices $\mathbf{q}$ with $|\mathbf{q}|=r$ . The multivariate Faà di Bruno’s formula Constantine and Savits [1996] implies for $|\mathbf{q}|\geq 2$ , $p>1$ and $u\in\{1,\dots,d\}$ , that

[TABLE]

Here $\bigl{[}\partial_{z_{1}}^{\bm{\ell}_{i}}X_{p-1}^{x}\bigr{]}^{\mathbf{k}_{i}}$ is defined in (49), and the summation is taken over the set $P(\mathbf{q},\mathbf{r})$ of multi-indices $\mathbf{k}_{i},\bm{\ell}_{i}\in\mathbb{N}_{0}^{d}$ , such that for some $1\leq s\leq|\mathbf{q}|$ , $\mathbf{k}_{i}=0$ and $\bm{\ell}_{i}=0$ for $1\leq i\leq|\mathbf{q}|-s$ , $|\mathbf{k}_{i}|>0$ for $|\mathbf{q}|-s+1\leq i\leq|\mathbf{q}|$ and $0\prec\ell_{|\mathbf{q}|-s+1}\prec\dots\prec\ell_{|\mathbf{q}|}$ are such that

[TABLE]

From the equation (72), taking the terms with $|\mathbf{r}|=1$ out and using the fact that $(X_{p-1}^{x})^{(\mathbf{r})}=0$ for any multi-index $\mathbf{r}$ with $|\mathbf{r}|\geq 2$ , we have

[TABLE]

For $p>j$ and fixed $v\in\{1,\dots,d\}$ , we then have

[TABLE]

with

[TABLE]

Furthermore,

[TABLE]

Note that the condition (73) implies that

[TABLE]

With the induction hypothesis (71), we bound $|\epsilon_{j,p}|$ as follows

[TABLE]

Due to [Constantine and Savits, 1996, Corollary 2.9],

[TABLE]

where $S_{|\mathbf{q}|}^{l}$ is a Stirling number of a second kind (see Constantine [1987]). Hence, we can bound

[TABLE]

with some constant $\mathrm{const}$ depending on $d,C_{b},|\mathbf{q}|,C_{1},\ldots,C_{|\mathbf{q}|-1}$ . Thus, (74) and (75) imply

[TABLE]

We can again apply Section 7.6 and Section 7.6 to bound $\left\|\nabla_{z_{j}}\partial_{z_{1}}^{\mathbf{q}}X^{x}_{p}\right\|_{\mathrm{F}}$ , and obtain (71) for all multi-indices $\mathbf{q}$ with $|\mathbf{q}|=r$ . This concludes the proof. ∎

Lemma \thelemma.

Let $(x_{p})_{p\in\mathbb{N}_{0}}$ and $(\epsilon_{p})_{p\in\mathbb{N}}$ be sequences of nonnegative real numbers with $x_{0}=0$ , satisfying

[TABLE]

*for any $p\in\mathbb{N}$ , and $\bar{C}_{1}$ is some nonnegative constant. Then *

[TABLE]

Proof.

Applying (76) recursively, we get $x_{p}\leq\sum_{r=1}^{p}\epsilon_{r}\prod_{k=r+1}^{p}\alpha_{k}$ where we use the convention $\prod_{k=p+1}^{p}:=1$ . The proof is completed by using an upper bound on $\epsilon_{p}$ . ∎

Lemma \thelemma.

Assume that there exist $m>0$ , such that for any $x\in\mathbb{R}^{d}$ , $x^{\top}Ax\geq m\|x\|^{2}$ . Then for any $\gamma\in(0,m/\|A\|^{2})$ , it holds

[TABLE]

Proof.

Note that for $\gamma\in(0,m/\|A\|^{2})$ ,

[TABLE]

∎

Corollary \thecorollary.

Under the assumptions of Section 5, for all $\gamma\in(0,m/\mathrm{L}^{2})$ , it holds

[TABLE]

Acknowledgement

The publication was supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Assaraf and Caffarel [1999] R. Assaraf and M. Caffarel. Zero-variance principle for Monte Carlo algorithms. Physical review letters , 83(23):4682, 1999.
2Belomestny et al. [2018] D. Belomestny, S. Häfner, and M. Urusov. Variance reduction for discretised diffusions via regression. Journal of Mathematical Analysis and Applications , 458:393–418, 2018.
3Belomestny et al. [2020] D. Belomestny, L. Iosipoi, E. Moulines, A. Naumov, and S. Samsonov. Variance reduction for markov chains with application to MCMC. Statistics and Computing , 30(4):973–997, 2020. doi: 10.1007/s 11222-020-09931-z . URL https://doi.org/10.1007/s 11222-020-09931-z . · doi ↗
4Ben Zineb and Gobet [2013] T. Ben Zineb and E. Gobet. Preliminary control variates to improve empirical regression methods. Monte Carlo Methods Appl. , 19(4):331–354, 2013. ISSN 0929-9629. doi: 10.1515/mcma-2013-0015 . URL https://doi.org/10.1515/mcma-2013-0015 . · doi ↗
5Bortoli and Durmus [2020] V. D. Bortoli and A. Durmus. Convergence of diffusions and their discretizations: from continuous to discrete processes and back. 2020.
6Boucheron et al. [2013] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence . Oxford University Press, 2013.
7Brosse et al. [2018] N. Brosse, A. Durmus, S. Meyn, and E. Moulines. Diffusion approximations and control variates for MCMC. ar Xiv preprint ar Xiv:1808.01665 , 2018.
8Constantine [1987] G. M. Constantine. Combinatorial Theory and Statistical Design . Wiley, New York, 1987.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Variance reduction for additive functional of Markov chains via martingale representations

Abstract

1 Introduction

2 Setup

Example 2.1** (Metropolis-Adjusted Langevin Algorithm).**

Example 2.2**.**

3 Martingale representation

Theorem 1**.**

Proof.

Corollary \thecorollary.

Discussion

Proposition \theproposition.

Proposition \theproposition.

Example 3.1**.**

4 Martingale Decomposition Control Variate (MAD-CV) algorithm

**Example 3.1 **(continued).

5 Gaussian noise model

H 1**.**

H 2**.**

Lemma \thelemma.

Proof.

Proposition \theproposition.

Proof.

H 3**.**

H 4**.**

Theorem 2**.**

Proof.

Remark \theremark.

6 Numerical experiments

6.1 Example 3.1 (continue)

6.2 Comparison with vanilla ULA

6.3 Random Walk Metropolis (RWM) example

6.4 Euler scheme for discretized diffusion

6.5 Lotka-Volterra model with feedback control

6.6 Multidimensional stochastic Lotka-Volterra model

7 Proofs

7.1 Notations.

7.2 Proof of Theorem 1

7.3 Proof of Section 5

7.4 Proof of Section 5

7.5 Proof of Theorem 2

7.6 Auxiliary lemmas

Lemma \thelemma.

Proof.

Lemma \thelemma.

Proof.

Lemma \thelemma.

Proof.

Corollary \thecorollary.

Acknowledgement

Example 2.1 (Metropolis-Adjusted Langevin Algorithm).

Example 2.2.

Theorem 1.

Example 3.1.

Example 3.1 (continued).

H 1.

H 2.

H 3.

H 4.

Theorem 2.