Unbiased Estimation of the Reciprocal Mean for Non-negative Random Variables
Sarat Moka, Dirk P. Kroese, Sandeep Juneja

TL;DR
This paper analyzes unbiased Monte Carlo estimators for the reciprocal of the expectation of a non-negative random variable, focusing on their theoretical properties, optimal parameter choices, and practical confidence intervals.
Contribution
It provides a detailed theoretical investigation of the properties of unbiased estimators for 1/ E[Z], including optimal parameter selection and asymptotic behavior.
Findings
Expected time variance product decreases as w decreases.
Estimator is asymptotically equivalent to the maximum likelihood biased ratio estimator.
Practical confidence intervals are established.
Abstract
Many simulation problems require the estimation of a ratio of two expectations. In recent years Monte Carlo estimators have been proposed that can estimate such ratios without bias. We investigate the theoretical properties of such estimators for the estimation of , where . The estimator, , is of the form , where and is any random variable with probability mass function on the positive integers. For a fixed , the optimal choice for is well understood, but less so the choice of . We study the properties of as a function of~ and show that its expected time variance product decreases as decreases, even though the cost of constructing the estimator increases with . We also show that the estimator is asymptotically equivalent to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Unbiased Estimation of the Reciprocal Mean for Non-negative Random Variables
Sarat Moka
School of Mathematics and Physics, The University of Queensland, Brisbane
Dirk P. Kroese
School of Mathematics and Physics, The University of Queensland, Brisbane
Sandeep Juneja
School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai
Abstract
Many simulation problems require the estimation of a ratio of two expectations. In recent years Monte Carlo estimators have been proposed that can estimate such ratios without bias. We investigate the theoretical properties of such estimators for the estimation of , where . The estimator, , is of the form , where and is any random variable with probability mass function on the positive integers. For a fixed , the optimal choice for is well understood, but less so the choice of . We study the properties of as a function of and show that its expected time variance product decreases as decreases, even though the cost of constructing the estimator increases with . We also show that the estimator is asymptotically equivalent to the maximum likelihood (biased) ratio estimator and establish practical confidence intervals.
1 Introduction
Over the past few years, unbiased Monte Carlo estimation methods have received significant attention, due to both theoretical interest and practical applications; see, for example, Rhee and Glynn, (2015); Blanchet et al., (2015); Blanchet and Glynn, (2015); Jacob and Thiery, (2015); McLeish, (2011). Efficient unbiased estimation of non-linear functions of expectations of random variables is challenging and has several applications; see, for example, Blanchet et al., (2015); Jacob and Thiery, (2015). An important “canonical” case is the unbiased estimation of for a non-negative random variable . Applications include regenerative simulation, estimating parameters involving densities with unknown normalizing constants, and Bayesian inference.
Motivated by these applications, we study properties of an unbiased estimator of proposed by Blanchet et al., (2015) (which is in turn based on the ideas proposed by Rhee and Glynn, (2015) in the context of stochastic differential equations). The estimator is obtained as follows. Write for ; here the condition guarantees the convergence of the geometric series . Further, let be a sequence of iid copies of , and let be a non-negative integer-valued random variable with , for all . Then
[TABLE]
Define,
[TABLE]
Clearly, and thus is an unbiased estimator of . Note that if almost surely for a constant , then with the choice , becomes non-negative. In this paper, the goal is to study optimal choices for and that make efficient. In particular, a brief description of our contributions is as follows:
- •
When is the variance-minimizing distribution for a fixed , we show that as , the expected cost to construct increases to , while both the variance and the expected time variance product of decease.
- •
As a consequence, we argue that for any , instead of approximating with a sample mean of iid copies of , it is optimal to approximate it by just one outcome of , where is such that and the expected cost of constructing is the same as that of the sample mean.
- •
We study the asymptotic distribution of as (i.e., as the expected computational cost for the estimator goes to ). We establish a central limit theorem type convergence result that is useful for finding asymptotically valid confidence intervals.
- •
We compare the asymptotic performance of the unbiased estimator with that of the maximum likelihood (biased) ratio estimator, where is estimated using the reciprocal of a sample mean of iid copies of .
- •
The above results are studied under the assumption that has the variance-minimizing distribution. Generating samples from this distribution is impossible as it involves unknown parameters. Since is unbiased even for a different distribution of , we develop a method to implement the estimator by proposing a distribution for (using samples of ) that closely resembles to the variance-minimizing distribution .
Background: Several applications of Monte Carlo simulation involve the estimation of for a non-negative random variable . In some applications it is a desirable property to have an unbiased estimator of when the magnitudes of the available biased estimators are unknown a priori. Examples include the estimation of a steady-state parameter for a regenerative stochastic process, where denotes the length of a regenerative cycle and denotes the cumulative reward obtained over the regenerative cycle; see, e.g, Glynn, (2006); Asmussen and Glynn, (2007); Moka and Juneja, (2015). It is evident that we have an unbiased estimator of if we have an unbiased estimator of . A similar case is where parameters can be expressed as for some real-valued function and probability density , where is known up to the normalizing constant . Such densities occur, for example, in Gibbs point processes (Møller and Waagepetersen, (2003)); and a standard method to estimate such parameters is by using Markov Chain Monte Carlo (MCMC) methods, see Asmussen and Glynn, (2007); Rubinstein and Kroese, (2017). However, in many situations it is difficult to bound the bias of the MCMC estimator, as the mixing time of the Markov chain is unknown. An alternative approach is to use a ratio estimator, where is approximated by ratio of the sample means of the numerator and the denominator. However, this still returns a biased estimator and the bias decreases at a rate that is inversely proportional to the sample size; see Remark 2 and also Asmussen and Glynn, (2007). Therefore, it is desirable to have an unbiased estimator for (and equally for ) that has the same order of complexity as that of the ratio estimator.
Most importantly, in some applications, having an unbiased estimator of is essential. For example, in the study of doubly intractable models in Bayesian inference, it is assumed that the observations follow a distribution with a density of the form where can be evaluated point-wise up to the normalizing constant ; see, for example, Lyne et al., (2015); Walker, (2011); Jacob and Thiery, (2015). Standard Metropolis–Hastings algorithms to obtain posterior estimates are not applicable due to the intractability of the normalizing constant. However, an exact inference method called pseudo-marginal Metropolis–Hastings proposed by Andrieu and Roberts, (2009) can be implemented if a non-negative unbiased estimator of is available; also see Beaumont, (2003); Jacob and Thiery, (2015); Walker, (2011). In particular, Jacob and Thiery, (2015) highlight the importance of the estimators of the form (1).
A standard method called Russian roulette truncation can be used for unbiased estimation of . This method is first proposed in the physics literature Carter and Cashwell, (1975); Lux and Koblinger, (1991) and further studied by McLeish, (2011); Glynn and Rhee, (2014); Lyne et al., (2015); Wei and Murray, (2016). The key drawback of these estimators is that they can take negative values with positive probability even when is bounded.
Organization of the paper: In Section 2, we study the properties of the estimator as a function of , under the assumption that has the variance minimizing estimator. An implementable method is proposed in Section 3. A conclusion of the paper is given in Section 4. All the results are proved in Appendix A.
2 Properties of the Estimator
Without loss of generality, assume that is non-degenerate. As the estimator in (1) is unbiased, a sample mean of independent copies of is an unbiased estimator of as well. It is well known that the sample mean has square-root convergence rate if ; see, e.g., Asmussen and Glynn, (2007). Thus, a simple strategy is to seek values of and that minimize . Using the Cauchy–Schwarz inequality, Blanchet et al., (2015) show that for any , is finite and is minimal if has a geometric distribution on the non-negative integers with success probability
[TABLE]
that is, if
[TABLE]
where the assumption guarantees that . Unfortunately, depends on and , which are unknown. However, is unbiased even when has a different distribution. Therefore, in the implementation of , we can replace with an estimate of ; see Section 3. In this section, we study the properties of under the assumption that has the distribution (2), because it offers an understanding of what is the best that can be expected from the estimator.
Note that the variance of is given by,
[TABLE]
for all . Further, observe that . Now we can ask what is the value of that minimizes (3). This question is not addressed in the existing literature. In addition to the variance, it is often important to include the running time to construct the estimator to determine its efficiency; see Glynn and Whitt, (1992). In that case, we need to select for which the expected time variance product, is minimal, where is the time required to construct . From (1) (since are iid), it is reasonable to assume that is proportional to the number of ’s used for constructing . Since samples of are used in the construction of , we assume that the expected time variance product is .
Theorem 1**.**
Suppose that has the geometric distribution given in (2). Then the following hold true.
- (i)
The success probability is a strictly concave function of with a maximum value of attained at , and
[TABLE]
- (ii)
The variance is a strictly increasing convex function of , with
[TABLE]
- (iii)
The expected time variance product is a strictly increasing function of , with
[TABLE]
Remark 1**.**
To understand the implications of Theorem 1, suppose we select a , giving an expected computational cost of to obtain . Further, let be the sample mean of iid copies of . Then the expected time variance product for is
[TABLE]
Now suppose that is selected so that the average cost to generate one outcome of is equal to the average cost to construct ; that is, . Then, from Theorem 1 (iii), for the same computational effort, has a smaller variance than for any feasible selected as above, and is therefore a better estimator.
To illustrate the results of Theorem 1, consider an example where for an event with probability . Since , the relative variance . By substituting the values of and , we can calculate , and for any . Figure 1 illustrate the effect of on the efficacy of the estimator . As expected, both and are decreasing as with the limits [math] and , respectively. ∎
Remark 1 motivates us to study the asymptotic distributional properties of as , when has the geometric distribution given in (2). Theorem 2 is crucial for establishing confidence intervals that are asymptotically valid as .
Theorem 2**.**
Suppose that has the distribution given by (2) and is bounded. Then, as ,
[TABLE]
*where denotes convergence in distribution, and and are independent random variables from respectively the standard (mean 1) exponential and standard normal distributions.
We show later in Section A.1 that as (see (8)). Since and , as . That means, an alternative expression for Theorem 2 (ii) is
[TABLE]
where . The above expression has more resemblance to the standard central limit theorem, since is the computational cost of the estimator. It is not difficult to show that is a random variable with density , which is the density of a mean zero Laplace (or double exponential) distribution with scale . These observations are useful for constructing asymptotically valid confidence intervals as follows. For any given , by solving for , we get . Then using Theorem 2, we can say that the interval
[TABLE]
is an asymptotic confidence interval for .
Remark 2** (Comparison with the ratio estimator).**
A standard (biased) estimator of is , where is the sample mean of iid copies of . Using Taylor’s theorem for the function about , we can easily show that the bias of is approximately for large , while, on the other hand, has zero bias. Furthermore, using the same Taylor’s theorem, we can show that the asymptotic time variance product of is as . From Theorem 1 (iii), the unbiased estimator has the same asymptotic expected time variance product. However, unbiasedness of comes at cost. As an application of the delta method, we can show that the ratio estimator satisfies the central limit theorem: That is, the ratio estimator is asymptotically normal. On the other hand, the asymptotic distribution of the unbiased estimator is Laplace, which has more slowly decaying tails than a normal distribution. In conclusion, the ratio estimator can have narrower confidence intervals than the unbiased estimator.
Remark 3** (Importance sampling).**
Just like in the case of the ratio estimator, from Theorems 1 and 2, the relative variance of is the key factor influencing the asymptotic properties of the unbiased estimator. The smaller the value is of the relative variance of , the better is the reliability of the unbiased estimator. One of the most effective technique of variance reduction is importance sampling. We can improve the performance of the estimator if we can implement an importance sampling technique for the random variable .
Remark 4** (The time variance product minimizing distribution for ).**
We have assumed that for a given the random variable has the variance minimizing distribution given by (2). However, when the criteria for the optimality of is the minimization of the expected time variance product, we need to seek a distribution that minimizes . It is shown in Blanchet et al., (2015) that the distribution that minimizes the expected time variance is given by
[TABLE]
where is the unique (positive) number satisfying When compared to the distribution (2), drawing samples from (4) has an extra difficulty of finding by solving an equation that contains an infinite sum. Even if we overcome this difficulty, the reduction in the expected time variance product is typically insignificant, because for small values of ,
[TABLE]
when has the variance minimizing distribution (2); see Section A.3 for a proof of (5). ∎
3 An Implementation
Recall that the success probability of is a function of unknown quantities and . However, fortunately, in (1) is still an unbiased estimator of for any distribution of . In particular, instead of taking as in (2), we can take , where is defined below. Under the proposed implementation, when the given budget is sufficiently large, half of the budget is used for estimating and then is chosen such that the remaining half the budget is used for generating a sample of the unbiased estimator.
To simplify the discussion, assume that there is a known constant ; for example, if for a constant , we can take . Let be a sequence of iid copies of , independent of the sequence , which is used in the construction of the unbiased estimator in (1). Define the first two sample moments: and . If , define,
[TABLE]
Otherwise, take and . The condition guarantees that . Further, whether or not, and hence it guarantees that the estmator (defined by (1)) with is an unbiased estimator of . Note that given and , the expected cost to construct to is (including the cost to construct ), since \mathbb{E}\left[N(w_{k})\big{|}M_{1}(k),M_{2}(k)\right]=1/P_{k}-1. Theorem 3 states that for large values of , this total expected cost is approximately , and conditioned on and , the expected time variance product goes to a random variable with mean . See Section A.4 for a proof Theorem 3.
Theorem 3**.**
Under the above construction, , and as , , and
[TABLE]
where is the square of a standard normal random variable.
To understand Theorem 3, consider the example given in Remark 1. We estimated the expected total cost and using samples of and , respectivley, with . Our simulation results show that the estimated expected time relative variance product is , which is approximately equal to , as expected.
4 Conclusion
We investigated the theoretical properties of a parametrized family of unbiased estimators of for a non-negative random variable . We studied the variance and the expected time variance product as functions of and established several asymptotic results. We showed that with an optimal choice of , the asymptotic performance of the unbiased estimator is comparable to the maximum likelihood (biased) ratio estimator. We further proposed an implementable unbiased estimation based on our results. Similar to Theorem 2, our ongoing research establishes a central limit theorem type convergence result for defined in Section 3, by taking the budget parameter .
Appendix A Appendix
To simplify the notation in this section, we use and . We also write for the derivative and for the second derivative. and denote respectively the mean exponential distribution and the geometric distribution on non-negative integers and the success probability .
A.1 Proof of Theorem 1
From the definition, It follows that where the strict inequality holds because , which follows from the assumption that is non-degenerative. Therefore, is strictly concave over and it achieves its maximum value at . From the definition of , it is evident that .
Recall from (3) that the variance of the estimator is equal to . Its derivative can be written as
[TABLE]
and the second derivative as
[TABLE]
Using Jensen’s inequality, where the strict inequality holds again because is non-degenerative. On the other hand, by Bernoulli’s inequality,
[TABLE]
is maximized by Thus,
[TABLE]
Using (8), we have and , and hence for all , and which establishes the convexity of over .
We now prove that the expected time variance product is a strictly increasing over . Let , and Then,
[TABLE]
and hence, we write
[TABLE]
where the inequality holds because , and . Therefore, is strictly increasing over .
The claims that and hold trivially because . To complete the proof of the theorem, we can write, by Taylor’s theorem, for any : for some Consequently,
[TABLE]
where for some . Since as and , we have . Further simplification yields that and thus, Since ,
[TABLE]
We conclude that both and go to their respective minima as .
A.2 Proof of Theorem 2
Statement (i) follows directly from Theorem 1 and Chebyshev’s inequality:
[TABLE]
for every . To prove (ii), consider a decreasing sequence such that and . We construct an almost surely increasing sequence such that and
[TABLE]
for a random variable . To do so we invoke Theorem 3.1 of Moka and Juneja, (2015). Let and . Then, Moka and Juneja, (2015) says that for each , there exist a random variable with cumulative distribution function
[TABLE]
such that is independent of , and has the same distribution as . Therefore, without loss of generality we assume that there is sequence of independent random variables such that for all .
Consider the natural filtration . Since ,
[TABLE]
where the last inequality holds because . We have since is independent of . Thus, is a supermartingale (with respect to ). In fact the sequence is bounded in , because , making it uniformly integrable submartingale. Thus, exists (see Theorem 2 in Section 4 of Chapter VII of Shiryaev, (1996)). Since This implies that .
Let . Then for all we have and . From (8), and hence . From the convergence of the sequence , we have . Therefore, (10) holds.
Now define
[TABLE]
From the definitions, is identical to in distribution. We now conclude the proof Theorem 2 by establishing lower and upper bounds on separately. Let be an upper bound on . From the construction of given by (11), for all such that , we have using (8) that
[TABLE]
Using Taylor’s theorem, for any , and thus,
[TABLE]
On the other hand, from (11) and (8),
[TABLE]
Using the strong law of large numbers and Theorem 1 of Richter, (1965),
[TABLE]
Further, using (8) and (10), we have that almost surely. From Taylor’s theorem with a Cauchy remainder term, we have almost surely
[TABLE]
for a random variable that takes values between and . Therefore, to complete the proof of (ii), it is enough to show that From (12),
[TABLE]
From (10) and because , the second term on the right hand side of the expression goes to zero. Again using (10), we conclude that the right hand side of (14) goes to in distribution. From (13),
[TABLE]
We complete the proof because from (8), as ,
[TABLE]
A.3 Proof of (5)
First, observe from the definitions that Further using the fact that and (8),
[TABLE]
where the last inequality holds from Jensen’s inequality, because is a convex function of for any constant and is a probability distribution. Furthermore, using , we write Since the distribution (4) is not the variance minimizing distribution, we have
[TABLE]
We establish (5) by showing that . From the definition of ,
[TABLE]
and thus, using , we write that Consequently, . Hence, using (8), , that is, and thus, because . This concludes that and hence establishes (5).
A.4 Proof of Theorem 3
From the assumption, we have . Therefore, similar to (8), we can obtain that
[TABLE]
From the definitions of and , it is easy to see that . Using the upper bound in (15), we show that is lower bounded by . Since and , for every realization of , there exists a such that for all , and hence \mathbb{V}\mathrm{ar}\,\left(\widehat{\beta}(w_{k})\big{|}M_{1}(k),M_{2}(k)\right) is finite and equal to . It is now enough to show that
[TABLE]
Write , where . Using (9) and , we have . By simplifying the terms in , we have Since , we can write . Therefore, using and the fact that asymptotically has a zero-mean normal distribution with variance , we complete the proof with the observation that as .
Acknowledgements
The work of the first and the second authors has been supported by the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS), under grant number CE140100049.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Andrieu and Roberts, (2009) Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Statist. , 37(2):697–725.
- 2Asmussen and Glynn, (2007) Asmussen, S. and Glynn, P. W. (2007). Stochastic simulation: algorithms and analysis , volume 57 of Stochastic Modelling and Applied Probability . Springer, New York.
- 3Beaumont, (2003) Beaumont, M. A. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics , 164(3):1139–1160.
- 4Blanchet et al., (2015) Blanchet, J. H., Chen, N., and Glynn, P. W. (2015). Unbiased Monte Carlo computation of smooth functions of expectations via taylor expansions. In Proceedings of the 2015 Winter Simulation Conference , WSC ’15, pages 360–367, Piscataway, NJ, USA. IEEE Press.
- 5Blanchet and Glynn, (2015) Blanchet, J. H. and Glynn, P. W. (2015). Unbiased Monte Carlo for optimization and functions of expectations via multi-level randomization. In Proceedings of the 2015 Winter Simulation Conference , WSC ’15, pages 3656–3667, Piscataway, NJ, USA. IEEE Press.
- 6Carter and Cashwell, (1975) Carter, L. and Cashwell, E. (1975). Particle-transport simulation with the monte carlo method. Technical Report .
- 7Glynn, (2006) Glynn, P. W. (2006). Chapter 16 simulation algorithms for regenerative processes. In Henderson, S. G. and Nelson, B. L., editors, Simulation , volume 13 of Handbooks in Operations Research and Management Science , pages 477–500. Elsevier.
- 8Glynn and Rhee, (2014) Glynn, P. W. and Rhee, C.-H. (2014). Exact estimation for markov chain equilibrium expectations. J. Appl. Probab. , 51A:377–389.
