Unbiased Multi-index Monte Carlo
Dan Crisan, Pierre Del Moral, Jeremie Houssineau, Ajay Jasra

TL;DR
This paper presents a novel unbiased multi-index Monte Carlo method that reduces computational effort and removes bias in expectation approximations of discretized random variables, with applications in PDE solutions and Bayesian inference.
Contribution
It introduces an unbiased multi-index Monte Carlo approach that improves efficiency and removes bias compared to traditional discretization sampling methods.
Findings
Reduced computational effort for a given error level.
Successfully applied to PDE solutions with random coefficients.
Effective in Bayesian inference for stochastic PDEs.
Abstract
We introduce a new class of Monte Carlo based approximations of expectations of random variables such that their laws are only available via certain discretizations. Sampling from the discretized versions of these laws can typically introduce a bias. In this paper, we show how to remove that bias, by introducing a new version of multi-index Monte Carlo (MIMC) that has the added advantage of reducing the computational effort, relative to i.i.d. sampling from the most precise discretization, for a given level of error. We cover extensions of results regarding variance and optimality criteria for the new approach. We apply the methodology to the problem of computing an unbiased mollified version of the solution of a partial differential equation with random coefficients. A second application concerns the Bayesian inference (the smoothing problem) of an infinite dimensional signal modelled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProbabilistic and Robust Engineering Design · Mathematical Approximation and Integration · Statistical Methods and Inference
\newsiamremark
exampleExample \newsiamremarkremarkRemark
\headersUnbiased Multi-index Monte CarloD. Crisan, P. Del Moral, J. Houssineau and A. Jasra
Unbiased Multi-index Monte Carlo
Dan Crisan Department of Mathematics, Imperial College London, London, SW7 2AZ, UK.
Pierre Del Moral INRIA Bordeaux, 200 Avenue de la Vieille Tour, 33405 Talence, FR.
Jeremie Houssineau Department of Statistics and Applied Probability, National University of Singapore. Email: [email protected]
Ajay Jasra Department of Statistics and Applied Probability, National University of Singapore.
Abstract
We introduce a new class of Monte Carlo based approximations of expectations of random variables such that their laws are only available via certain discretizations. Sampling from the discretized versions of these laws can typically introduce a bias. In this paper, we show how to remove that bias, by introducing a new version of multi-index Monte Carlo (MIMC) that has the added advantage of reducing the computational effort, relative to i.i.d. sampling from the most precise discretization, for a given level of error. We cover extensions of results regarding variance and optimality criteria for the new approach. We apply the methodology to the problem of computing an unbiased mollified version of the solution of a partial differential equation with random coefficients. A second application concerns the Bayesian inference (the smoothing problem) of an infinite dimensional signal modelled by the solution of a stochastic partial differential equation that is observed on a discrete space grid and at discrete times. Both applications are complemented by numerical simulations.
keywords:
Monte Carlo, Multi-index discretization, Unbiasedness, Smoothing.
1 Introduction
Following Kalman’s seminal paper [11], the theory and applications of filtering, smoothing and prediction has expanded rapidly and massively and is as strong today as it was seventy years ago. The area has evolved far beyond the original linear/Gaussian framework, many current applications involving nonlinear signals (possibly high dimensional or even infinite dimensional) and nonlinear observations (see [4] for some of the recent developments). General filtering, smoothing and prediction problems no longer have explicit solutions: inference is achieved mostly through numerical methods. Methods based on Monte Carlo integration are very popular, particularly for high dimensional problems.
Monte Carlo methods have proved to be sensible in a variety of applications. However, in certain contexts, multilevel Monte Carlo (MLMC) integration has been introduced with success, see e.g. [5, 6]. This method is useful when one has a probability law associated to a discretization in a single dimension, for instance time. It can reduce the computational effort, relative to i.i.d. sampling (Monte Carlo) from the most precise discretization, for a given level of error. This has been extended in a subtle manner to discretizations in a multiple dimensions, for instance space and time, by [8]. To be more precise let us assume that we are given a probability measure defined on a measure space and a collection of bounded-measurable real-valued functions . We seek to compute for
[TABLE]
where is a random variable with law . We assume that the random variable is associated to a -dimensional multi-index continuum system, such as the one described in [8]. For instance, may be associated to the solution of a stochastic partial differential equation (SPDE). Such systems are found in a wide variety of applications, see [1] for examples. Alternatively can be the solution of a filtering/smoothing problem as we will explain later on.
Whilst the probability is not available to us, we have access to a class of biased approximations where is the set on multi-indices of length with integer non-negative entries. For instance, by using Monte Carlo integration, one can then compute where being a random variable with law and even though we have a bias, i.e., we will assume that
[TABLE]
where the limit is understood as and, naturally, that the computational cost associated with increases as the values of increase.
As an example of the above general context, consider discretely observing data associated to a signal modelled by the solution of a stochastic partial differential equation (a concrete example can be found in Section 4.3). Suppose that data is obtained at unit time interval and is denoted by . We assume that, conditional upon , are conditionally independent with density on a finite dimensional space with the solution of the SPDE at time . Moreover we assume is well defined for any , even if is discretized. Let be the transition density of the SPDE under a time and space discretization corresponding to a multi-index . Given observed data our objective is to compute expectations with respect to the following distribution (known as a smoother or smoothing distribution for ):
[TABLE]
where we assume that is given and we used the notation and . If the class of discretizations is chosen so that Eq. 1 holds, then the methodology developed in this paper can be applied to solve this problem. It is worth noting that, integrating with respect to is non-trivial and well-known to be challenging. See, for instance, the work of [3, 12] and the references therein for more details.
1.1 Contribution and Structure
In the context described above, it is well-known that the Monte Carlo approximation of , which we assume is required, can be significantly enhanced through the use of the MIMC method of [8]. This idea is intrinsically linked to the popular MLMC approach. In this latter approach the dimensionality of the index is 1. Briefly writing the indices , ( being the finest discretization and 1 the coarsest and the discretization becomes more and more fine from 1) we have
[TABLE]
that is, introducing a collapsing sum representation of the expectation w.r.t. the finest discretization. The idea is then to dependently sample a pair of approximations (i.e to dependently couple them) independently for . Note the case is just i.i.d. sampling from . If the dependence in the coupling is sufficiently positively correlated, then it is possible to sample fewer simulations at the fine discretizations (which are expensive to sample) and more at the coarse discretizations in such a way that the cost associated to obtain a prescribed mean square error of the MLMC approximation is less than that of i.i.d. sampling from the finest discretization. In the MIMC context, it is a more challenging procedure, but similar reductions in computational cost can also be possible.
A randomized version of the MLMC approach has been developed in [13], which removed the discretization bias. In this work, we show how one can extend this idea in the context where the discretization parameters are in multiple dimensions. This extension allows for a judicious allocation of the computational effort in order to take into account the variance of the target distribution discretization in separate dimensions. In particular, Monte Carlo approximations are constructed to entirely remove the discretization bias, that is, to approximate directly. We also analyze the variance of the methodology and propose several original optimality criteria for its implementation. Several simulated examples are considered.
This article is structured as follows. In Section 2 we give some notations, the approach and some preliminary results. In Section 3 our main theoretical results and the corresponding proofs are given. In Section 4 the new methodology is illustrated by numerical examples. Section 5 summarizes our work, with a discussion of future work.
2 Notation and Preliminary Results
Throughout the article, a complete probability space is considered with denoting the expectation with respect to and denoting the random variable on defined as the indicator of the event .
We work on the lattice for some equipped with the natural partial order which is defined as if and only if for all . Note that different total orders can also be defined on this -dimensional lattice, such as the lexicographical order, but these total orders are to some extent arbitrary and will not be directly useful in the context we consider in this paper. For the sake of simplicity, and will denote the random variables and respectively, for any . Let such that and consider
[TABLE]
By a usual abuse of notation, is used instead of if the superscript verifies for all , and similarly for the subscript. It holds that .
An estimator of is defined as
[TABLE]
where is a random variable on independent of which guarantees that the estimator is unbiased, i.e. it ensures that holds, and where with
[TABLE]
for any where is the element of such that and if . The order of the application of the operators in the definition of is irrelevant since these operators can be easily seen to commute. The choice of a vector-valued random variable in Eq. 2 is justified by the fact that there might be interest in calculating the sum up to a non-diagonal index. For instance, relying on the increments with some very high and some very low components might yield an estimator with low variance at a reasonable computational cost. The following lemma will be useful to prove that the estimator is unbiased.
Lemma 2.1**.**
The increment can be rewritten for any as
[TABLE]
*with defined as the 1-norm on . *
Proof 2.2**.**
This result can be proved by recurrence on the dimension . The case is obvious. If Eq. 3 is assumed to hold for a given then for any ,
[TABLE]
*where , hence showing that the relation is true for the dimension . *
It follows from Lemma 2.1 that a given term is going to appear exactly once in each of the increments with , that is times, negatively in of the increments and positively in the other increments, therefore cancelling in . However, this does not take into account cases where the condition is not satisfied and is only valid for terms for which all the are included in the considered sum.
Lemma 2.3**.**
For any such that , it holds that
[TABLE]
*where is the number of components of equal to . *
Proof 2.4**.**
From Lemma 2.1, it holds that
[TABLE]
The inner sum on the r.h.s. can be written as
[TABLE]
for any . Denoting the cardinality of a set , it follows that the terms corresponding to a non-empty subset of appear times, times positively and times negatively, therefore cancelling out. The remaining terms are the ones corresponding to since they only appear time. It follows that
[TABLE]
where is characterised by for any and any . The result of the lemma follows by a change of variable in the sum on the r.h.s. and by verifying that
[TABLE]
holds for any , so that \big{|}t^{-1}_{k,n}(\bm{\alpha})\big{|}=\ell_{k}(\bm{\alpha}).
3 Main Theoretical Results
We now consider several theoretical results for our approach, which justify its practical implementation.
3.1 Unbiasedness
Lemma 2.3 still does not apply to sums of increments containing indices that do not verify . Removing this last restriction leads in the following theorem.
Theorem 3.1**.**
*The estimator is unbiased. *
Proof 3.2**.**
Let be a partial version of the estimator defined as
[TABLE]
Because of the independence between and the , the estimator satisfies
[TABLE]
which can be further expressed as
[TABLE]
where, defining and denoting , the operator is defined as . In the case , the inner sum is equal to . Using Lemma 2.3, it follows that
[TABLE]
where, considering as a function from to , denotes the restriction of to the set . Therefore, for any , the term appears once in the inner sum whenever the support of is included in . Denoting , it holds that if then appears times, positively if is even and negatively if is odd. It follows that
[TABLE]
*since the binomial formula in the first line differs from zero only when verifies , that is when . The desired result follows by taking the limit under the condition stated in Eq. 1. *
3.2 Variance of the unbiased estimator
Being able to determine the variance of the unbiased estimator will be important when looking for an optimal distribution for the random variable . We give expressions of the variance of as well as for useful special cases. An additional notation is required for the statement of the following proposition: denotes the component-wise maximum of any and in , that is
[TABLE]
Proposition 3.3**.**
Assuming that
[TABLE]
the second moment of exists and is found to be
[TABLE]
*where . *
Remark 3.4**.**
*The condition stated in Eq. 6 will hold if the probability decreases sufficiently slowly when compared with the discretization error . For instance, when solving a partial differential equation, the tail of the distribution of should not be smaller than the decay of the error associated with the refinement of the mesh. *
Proof 3.5**.**
In order to study the variance of the estimator, consider
[TABLE]
where is the -norm. For the same reasons as before, it can be verified that holds by adding and subtracting times the random variable . It follows that
[TABLE]
where the inequality , which holds for any , has been used. Assuming that
[TABLE]
*it follows that can be made arbitrarily small by considering large enough, so that is a Cauchy sequence which therefore converges in the Hilbert space , so that the second moment of is finite. This completes the proof of the proposition. *
Remark 3.6**.**
*By considering a total order on such as the lexicographical order, dual sums over and in some given subset of could be split into diagonal and non-diagonal elements, the latter being simplified to terms verifying . However, this would not allow for simplifying the indicator function since does not imply that in general. *
Remark 3.7**.**
The case has been studied in [13, Theorem 1] and yields an expression of the second moment which simplifies drastically and which can be expressed as a single sum as
[TABLE]
where and are now integers. The condition for the existence of reduces to
[TABLE]
*for , which is much simpler to verify than Eq. 6. *
The random variable can be chosen in such a way as to simplify Eq. 7: If the components of are assumed to be independent random variables, then
[TABLE]
However, this expression of the second moment is still more complicated than in the case detailed in Remark 3.7 as it involves a double sum. Yet, another special case of the estimator can be obtained by assuming that is a constant function, i.e. that almost surely for any . This estimator will be denoted and is expressed as follows
[TABLE]
where is the integer-valued random variable induced by and where is the supremum norm. Since is defined as the estimator for a special choice for , it is also unbiased.
Proposition 3.8**.**
If it exists, the second moment of the estimator takes the form
[TABLE]
*with \nu^{\prime}_{\bm{\alpha}}\doteq\mathbb{E}\big{[}\Delta S_{\bm{\alpha}}\big{(}(S-S_{|\bm{\alpha}|_{\infty}-1})+(S-S_{|\bm{\alpha}|_{\infty}})\big{)}\big{]}. *
Remark 3.9**.**
The expression of the second moment is closer to the one obtained in Remark 3.7 for the case . This is natural since the simplification from to amounts to making single-variate, so that only the terms retain their multi-index nature. The expression of for can be recovered easily as
[TABLE]
Proof 3.10**.**
As in the proof of Proposition 3.3, a partial version of can be introduced as
[TABLE]
It holds that
[TABLE]
where the following relations have been used with :
[TABLE]
and
[TABLE]
*The desired result is obtained by rearranging the terms and taking the limit . *
We note that if one produces independent realizations of then Proposition 3.8 can be used to obtain a specific variance. That is, for specific models (see e.g. [8]) one has expressions for , appropriately centered, in terms of a function which goes to zero as such that . Then, for some , one can choose the number of samples to make the variance as in the MLMC/MIMC literature [5, 8].
Following [13], a variant of the estimator can be introduced as follows
[TABLE]
where the random variable is defined for any as
[TABLE]
with the joint random variable having the same marginal distributions as the joint , making unbiased. The estimators and can be respectively referred to as the coupled-sum estimator and the independent-sum estimator. A simpler version of the estimator can be introduced as previously for by assuming that realisations of are constant on almost surely:
[TABLE]
Proposition 3.11**.**
If it exists, the second-moment of the estimator is found to be
[TABLE]
*with \tilde{\nu}^{\prime}_{\bm{\alpha}}\doteq\operatorname{\textbf{var}}(\Delta S_{\bm{\alpha}})+\mathbb{E}\Delta S_{\bm{\alpha}}\big{(}(\mathbb{E}S-\mathbb{E}S_{|\bm{\alpha}|_{\infty}-1})+(\mathbb{E}S-\mathbb{E}S_{|\bm{\alpha}|_{\infty}})\big{)}. *
Proof 3.12**.**
The partial version of the estimator is introduced as before with and verifies
[TABLE]
[TABLE]
*from which the result of the proposition follows. *
3.3 Optimal distribution for
Since is a design random variable, its distribution can be chosen in a way that maximises the performance of the corresponding MIMC method in a sense to be defined. The objective is to optimise jointly the computational effort and the accuracy of the algorithm. The former is quantified by the time necessary to compute one instance of while the latter is represented by the variance of . Consider an arbitrary total order on that is compatible with , i.e. such that implies for any , and define as the time to compute the terms in that have not already been computed for previous with . Let be the time necessary to compute , then
[TABLE]
For a given duration , let be the number of copies of that can be generated in this amount of time. The sample average defined as
[TABLE]
can then be used to formulate a CLT when and are finite as [7]
[TABLE]
where , where is the standard Gaussian distribution and where denotes the convergence in probability when .
Remark 3.13**.**
For instance, the computational time for the term can be assumed to be equal to . This assumption makes sense in many cases including the ones considered here in the numerical results where partial differential equations are solved on meshes with elements. If we consider the independent sum-estimator then is the computational time for the whole of which verifies
[TABLE]
Then, the expected computational time will be finite if the probability is of order with . However, for Eq. 6 to hold, the tail of the distribution of also needs to be sufficiently large. For instance, if for some related to the considered numerical scheme, then
[TABLE]
*and also have to verify since for any . The condition Eq. 6 can be weakened for special cases to allow for more freedom in the choice of . *
Equation 13 indicates that the distribution of the random variable can be chosen in a way to make the product between the expected computational time and the variance as small as possible. The following problem is therefore considered:
[TABLE]
The solution to Eq. 14 is difficult to formulate in general, however, the special case of the estimator yields the simpler problem:
[TABLE]
subject to Eqs. 14b to 14d and
[TABLE]
By a direct generalisation of [13, Proposition 1], it holds that if the net , defined as and for any , is non-negative then the following inequality holds
[TABLE]
with characterised by
[TABLE]
for any . However, might not be feasible and the solution over the feasible region is denoted ; by [13, Proposition 2], this minimum is achieved if is positive and is bounded below by a positive constant.
Due to the constraint Eq. 15, the probability mass function induced by can be characterised by its values on the diagonal of and we consider the sequence of indices such that and
[TABLE]
where is a shorthand notation for with . Denoting, for any strictly increasing integer-valued sequence ,
[TABLE]
it follows that
[TABLE]
Extending the results of [13, Theorem 3] to the considered setting, it holds that if is a positive net and is non-decreasing w.r.t. , then there exists an optimiser inducing a sequence such that
[TABLE]
where is the unique integer verifying . It follows that
[TABLE]
These expressions are the same for unbiased MLMC and the considered instance of unbiased MIMC so that [13, Algorithm 1] can be used to find the desired optimal sequence .
4 Numerical Results
In order to evaluate the performance of the proposed unbiased MIMC (UMIMC) method, a comparison with the MIMC algorithm of [8] is performed on two different problems. The first is covered in Section 4.2 and comprises computing a mollified version of the solution of a partial differential equation with random coefficients. The second application is an inference problem for a partially observed signal modelled by an SPDE on a 1-dimensional domain in Section 4.3. We begin by giving some implementation details for the UMIMC method.
4.1 Implementation
In this section as well as in the numerical results, we make use of the simplified version of the independent-sum estimator . Since, in practice, realisations of can only be computed up to a certain level, the partial estimator defined as
[TABLE]
has to be considered instead, for a given . In order to accurately calculate the optimal sequence yielding the optimal distribution for as described in Section 3.3, a high number of realisations of has to be computed for any . This fact implies that the computational effort required to start estimating from is high. To bypass this limitation and reduce the number of realisations of computed before calculating , the latter is updated frequently. One of the consequences of this solution is that the probabilities will vary through the iterations. This drawback can be compensated for by dividing the increment by the number of times it has been sampled, instead of the probability of it being sampled. In spite of their difference, these two normalisations are equivalent asymptotically, by an easy application of the strong law of large numbers. Note that with the adaptations, one must do some more work to verify the consistency of the estimator, but this is left for future work. The estimation thus takes the following form in practice:
[TABLE]
where is the total number of iterations, where is the sampled increment and where is a sample from at the th iteration.
4.2 A Partial Differential Equation with random coefficients
We consider here a partial differential equation with random coefficients of the form
[TABLE]
Similarly to [8], the diffusion coefficient is defined as
[TABLE]
and the random variable of interest is
[TABLE]
with and . The objective is to compute using the proposed method with . For a each realisation of and , the partial differential equation Eq. 16 is solved by finite element method on a linear and uniform meshing defined on with elements in the th dimension for a multi-index . The terms corresponding to any index such that are not computed in order to avoid numerical issues with degenerated elements.
To better understand the accuracy associated with each index, some of the produced meshes at different levels are shown in Fig. 1. Figure 2 shows the RMSE as a function of the computational effort for the unbiased MIMC when the greatest multi-index available is either , or , compared with the MIMC algorithm described in [8, Sec. 3.2.2]. Three scalar parameters have to be set in the MIMC algorithm, the accuracy , a splitting parameter and a confidence level defined such that
[TABLE]
where is the MIMC estimator. The values , and are considered here. The maximum computational effort considered is increased with the value of to take into account the corresponding computational overhead.
Comparing Figs. 2(a) to 2(c), it appears that the proposed implementation of the UMIMC and the considered version of MIMC behave very differently in time: the UMIMC has a higher error at the start since it relies on all levels at all times and requires more computational effort to compensate for the randomness in the coefficient of the considered PDE. This effect is more pronounced when the value of increases. However, the error in the MIMC increases at the times when it starts to perform computations for higher indices, since the effect of the random coefficients has to be averaged again, whereas the error of the UMIMC decreases monotonically. These remarks about each method do not allow to conclude that one is better than the other, but they highlights the differences in terms of implementation: the considered version of the MIMC attempts to reach a given level of precision determined by some parameters while the UMIMC requires the setting of fewer parameters but offers less control on its behaviour.
4.3 Inference of partially observed solutions of SPDE
Following [14], consider an unknown signal modelled by the solution of the SPDE
[TABLE]
where the eigenfunction has corresponding eigenvalue , and with a cylindrical Brownian motion
[TABLE]
where the terms , , are independent Brownian motions. The final time is set to and the variance of the Brownian motion is set to . The SPDE is observed at times for and at the locations for under an additive Gaussian noise with standard deviation , for some integers and . The observation vector at time is denoted and is made of the scalar observations made at the locations , . The corresponding likelihood function is
[TABLE]
where is the th component of the vector and where is the value of the SPDE at time and location , for any . An example of the set of observations obtained on one solution of Eq. 18 is given in Fig. 3. To ease the estimation procedure, the standard deviation of the observation noise is taken times bigger in the UMIMC and MIMC recursions.
At index , this SPDE is solved using the exponential Euler scheme of [14] with the first eigenfunctions and time steps. The MIMC is used in the same way as in the previous section, but with a tolerance . The quantity to estimate is the integral of the true path at the last time step, so that
[TABLE]
where the expectation is w.r.t. to the path , where is the integral of the path and where is the vector containing the values of the path at the locations and at time for some . The experiments are run with and . The results displayed in Fig. 4 show the same type of behaviour as in the simulation study of Section 4.2 in spite of the differences between the two considered problems, e.g. in the nature of the approximation represented by the indices (spatial discretization in 2 dimensions vs. time discretization plus number of basis functions).
5 Summary
In this article, we have considered exact approximation of expectations associated to probability laws with discretizations in multiple dimensions. We have developed several optimality results and implemented the methodology to a couple of numerical examples.
Future work associated to this methodology, includes combining our method in scenarios for which independent sampling from the (discretized) multi-index target is not possible. For instance, where one has to use Markov chain or sequential Monte Carlo methods (e.g. [2] in the case of a single index). The analysis in such a scenario is of interest as is its application, to enhance the range of examples where our approach can be implemented. This is being conducted in [10].
Acknowledgements
JH & AJ were supported by an AcRF tier 2 grant: R-155-000-143-112. AJ is affiliated with the CQF, RMI and OR cluster, NUS. DC was partially supported by the EPSRC grant: EP/N023781/1.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Beskos , A., Roberts , G. O., Stuart , A. M., & Voss , J. (2008). MCMC for diffusion bridges. Stoch. Dynam. , 8 , 319–350.
- 2[2] Beskos , A., Jasra, A., Law , K. J. H, Tempone , R. & Zhou , Y. (2017). Multilevel sequential Monte Carlo samplers. Stoch. Proc. Appl. , 127 , 1417-1440.
- 3[3] Beskos , A., Crisan , D., Jasra , A., Kamatani , K., & Zhou , Y. (2017). A stable particle filter in high-dimensions. Adv. Appl. Probab. , 49 , 24-48.
- 4[4] Crisan , D., Rozovskii , B. (2011) The Oxford handbook of nonlinear filtering, Oxford Univ. Press . Oxford,
- 5[5] Giles , M. B. (2008). Multilevel Monte Carlo path simulation. Op. Res. , 56 , 607-617.
- 6[6] Giles , M. B (2015). Multilevel Monte Carlo methods. Acta Numerica , 24 , 259-328.
- 7[7] Glynn , P. W. & Whitt , W. (1992). The asymptotic efficiency of simulation estimators. Op. Res. , 40 , 505–520.
- 8[8] Haji-Ali, A. L., Nobile , F. & Tempone , R. (2016). Multi-Index Monte Carlo: When sparsity meets sampling. Numerische Mathematik , 132 , 767–806.
