TL;DR
This paper introduces an energy statistic-based ABC algorithm that improves likelihood-free Bayesian inference by providing asymptotic guarantees and demonstrating competitive performance across various models.
Contribution
It proposes a novel ABC method using the energy statistic, with new asymptotic results and a consistent estimator, enhancing the robustness of likelihood-free inference.
Findings
The energy statistic-based ABC converges to the true pseudo-posterior.
The method performs well compared to alternative discrepancy measures.
Asymptotic results hold for increasing sample sizes in both observed and simulated data.
Abstract
Approximate Bayesian computation (ABC) has become an essential part of the Bayesian toolbox for addressing problems in which the likelihood is prohibitively expensive or entirely unknown, making it intractable. ABC defines a pseudo-posterior by comparing observed data with simulated data, traditionally based on some summary statistics, the elicitation of which is regarded as a key difficulty. Recently, using data discrepancy measures has been proposed in order to bypass the construction of summary statistics. Here we propose to use the importance-sampling ABC (IS-ABC) algorithm relying on the so-called two-sample energy statistic. We establish a new asymptotic result for the case where both the observed sample size and the simulated data sample size increase to infinity, which highlights to what extent the data discrepancy measure impacts the asymptotic pseudo-posterior. The resultâŚ
| MAE | RMSE | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| ES | 0.594 | 0.045 | 0.607 | 0.063 | 0.215 | 0.030 | 0.283 | 0.055 | |
| KL | 0.648 | 0.039 | 0.666 | 0.048 | 0.165 | 0.016 | 0.205 | 0.026 | |
| WA | 0.675 | 0.035 | 0.682 | 0.043 | 0.152 | 0.020 | 0.181 | 0.021 | |
| MMD | 0.564 | 0.079 | 0.582 | 0.076 | 0.234 | 0.054 | 0.311 | 0.101 | |
| ES | 0.587 | 0.063 | 0.613 | 0.059 | 0.215 | 0.038 | 0.282 | 0.069 | |
| KL | 0.651 | 0.042 | 0.667 | 0.061 | 0.169 | 0.022 | 0.210 | 0.027 | |
| WA | 0.655 | 0.050 | 0.669 | 0.047 | 0.152 | 0.015 | 0.187 | 0.019 | |
| MMD | 0.559 | 0.076 | 0.598 | 0.075 | 0.235 | 0.049 | 0.313 | 0.092 | |
| ES | -0.699 | 0.046 | -0.716 | 0.040 | 1.401 | 0.043 | 1.412 | 0.039 | |
| KL | -0.709 | 0.029 | -0.712 | 0.035 | 1.409 | 0.029 | 1.415 | 0.029 | |
| WA | -0.699 | 0.030 | -0.704 | 0.037 | 1.399 | 0.030 | 1.404 | 0.030 | |
| MMD | -0.709 | 0.054 | -0.731 | 0.036 | 1.411 | 0.051 | 1.422 | 0.038 | |
| ES | -0.696 | 0.058 | -0.712 | 0.043 | 1.396 | 0.058 | 1.407 | 0.049 | |
| KL | -0.711 | 0.047 | -0.704 | 0.057 | 1.411 | 0.047 | 1.416 | 0.047 | |
| WA | -0.695 | 0.043 | -0.695 | 0.053 | 1.395 | 0.043 | 1.401 | 0.043 | |
| MMD | -0.711 | 0.066 | -0.726 | 0.046 | 1.411 | 0.066 | 1.424 | 0.052 |
| MAE | RMSE | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| ES | 0.569 | 0.042 | 0.570 | 0.045 | 0.083 | 0.015 | 0.100 | 0.017 | |
| KL | 0.664 | 0.028 | 0.658 | 0.031 | 0.106 | 0.017 | 0.132 | 0.019 | |
| WA | 0.509 | 0.033 | 0.505 | 0.038 | 0.112 | 0.022 | 0.133 | 0.026 | |
| MMD | 0.583 | 0.044 | 0.586 | 0.048 | 0.079 | 0.013 | 0.096 | 0.015 | |
| ES | 0.215 | 0.035 | 0.219 | 0.035 | 0.111 | 0.015 | 0.135 | 0.019 | |
| KL | 0.274 | 0.023 | 0.280 | 0.027 | 0.110 | 0.014 | 0.134 | 0.014 | |
| WA | 0.205 | 0.025 | 0.207 | 0.030 | 0.090 | 0.029 | 0.112 | 0.034 | |
| MMD | 0.220 | 0.037 | 0.220 | 0.036 | 0.108 | 0.010 | 0.132 | 0.012 |
| MAE | RMSE | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| ES | 1.299 | 0.223 | 1.189 | 0.264 | 0.713 | 0.130 | 0.885 | 0.165 | |
| KL | 1.389 | 0.190 | 1.333 | 0.165 | 0.696 | 0.151 | 0.877 | 0.205 | |
| WA | 1.286 | 0.220 | 1.193 | 0.265 | 0.672 | 0.128 | 0.828 | 0.153 | |
| MMD | 1.229 | 0.188 | 1.143 | 0.241 | 0.676 | 0.092 | 0.836 | 0.121 | |
| ES | 1.362 | 0.185 | 1.290 | 0.237 | 0.716 | 0.118 | 0.904 | 0.131 | |
| KL | 1.235 | 0.152 | 1.153 | 0.170 | 0.588 | 0.070 | 0.745 | 0.097 | |
| WA | 1.292 | 0.196 | 1.240 | 0.241 | 0.657 | 0.114 | 0.817 | 0.139 | |
| MMD | 1.268 | 0.173 | 1.170 | 0.171 | 0.669 | 0.103 | 0.841 | 0.131 | |
| ES | 1.170 | 0.132 | 1.183 | 0.157 | 0.459 | 0.045 | 0.552 | 0.049 | |
| KL | 1.083 | 0.100 | 1.077 | 0.088 | 0.394 | 0.034 | 0.496 | 0.045 | |
| WA | 1.229 | 0.118 | 1.216 | 0.132 | 0.426 | 0.054 | 0.521 | 0.059 | |
| MMD | 1.181 | 0.116 | 1.182 | 0.143 | 0.456 | 0.051 | 0.548 | 0.061 | |
| ES | 1.128 | 0.112 | 1.113 | 0.138 | 0.435 | 0.032 | 0.534 | 0.045 | |
| KL | 1.133 | 0.111 | 1.086 | 0.135 | 0.390 | 0.038 | 0.498 | 0.051 | |
| WA | 1.218 | 0.110 | 1.196 | 0.108 | 0.409 | 0.049 | 0.514 | 0.066 | |
| MMD | 1.150 | 0.098 | 1.133 | 0.130 | 0.423 | 0.041 | 0.518 | 0.049 | |
| ES | 1.343 | 0.096 | 1.360 | 0.104 | 0.428 | 0.052 | 0.514 | 0.059 | |
| KL | 1.300 | 0.087 | 1.250 | 0.065 | 0.384 | 0.040 | 0.491 | 0.061 | |
| WA | 1.300 | 0.101 | 1.298 | 0.105 | 0.370 | 0.058 | 0.446 | 0.066 | |
| MMD | 1.258 | 0.115 | 1.232 | 0.120 | 0.375 | 0.055 | 0.454 | 0.063 |
| MAE | RMSE | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| ES | 3.024 | 0.044 | 3.009 | 0.047 | 0.133 | 0.016 | 0.170 | 0.018 | |
| KL | 2.955 | 0.030 | 2.948 | 0.033 | 0.105 | 0.013 | 0.128 | 0.013 | |
| WA | 3.043 | 0.045 | 3.052 | 0.067 | 0.232 | 0.020 | 0.277 | 0.020 | |
| MMD | 3.081 | 0.061 | 3.062 | 0.065 | 0.177 | 0.029 | 0.221 | 0.036 | |
| ES | 1.046 | 0.062 | 1.027 | 0.079 | 0.268 | 0.024 | 0.322 | 0.029 | |
| KL | 0.918 | 0.071 | 0.885 | 0.068 | 0.313 | 0.026 | 0.375 | 0.029 | |
| WA | 0.894 | 0.127 | 0.869 | 0.136 | 0.277 | 0.044 | 0.334 | 0.045 | |
| MMD | 0.899 | 0.069 | 0.855 | 0.079 | 0.374 | 0.029 | 0.440 | 0.030 | |
| ES | 2.289 | 0.101 | 2.264 | 0.210 | 0.872 | 0.098 | 1.026 | 0.091 | |
| KL | 2.993 | 0.080 | 3.046 | 0.121 | 1.043 | 0.070 | 1.193 | 0.066 | |
| WA | 2.581 | 0.101 | 2.599 | 0.147 | 0.858 | 0.078 | 1.025 | 0.075 | |
| MMD | 2.184 | 0.128 | 2.227 | 0.190 | 0.904 | 0.103 | 1.052 | 0.100 | |
| ES | 0.476 | 0.046 | 0.444 | 0.067 | 0.225 | 0.014 | 0.270 | 0.015 | |
| KL | 0.550 | 0.059 | 0.498 | 0.064 | 0.252 | 0.029 | 0.317 | 0.045 | |
| WA | 0.544 | 0.095 | 0.526 | 0.094 | 0.189 | 0.035 | 0.238 | 0.046 | |
| MMD | 0.691 | 0.056 | 0.621 | 0.072 | 0.380 | 0.041 | 0.502 | 0.070 | |
| ES | -0.163 | 0.047 | -0.178 | 0.069 | 0.197 | 0.032 | 0.246 | 0.034 | |
| KL | -0.291 | 0.034 | -0.324 | 0.037 | 0.117 | 0.014 | 0.144 | 0.020 | |
| WA | -0.288 | 0.026 | -0.314 | 0.035 | 0.125 | 0.016 | 0.152 | 0.020 | |
| MMD | -0.194 | 0.047 | -0.210 | 0.063 | 0.174 | 0.030 | 0.218 | 0.035 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Approximate Bayesian computation via the energy statistic
HIEN D. NGUYEN1
1Department of Mathematics and Statistics, La Trobe University, Bundoora Melbourne 3066, Victoria Australia. (e-mail: [email protected]) 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
JULYAN ARBEL2
1Department of Mathematics and Statistics, La Trobe University, Bundoora Melbourne 3066, Victoria Australia. (e-mail: [email protected]) 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
HONGLIANG LĂ2
1Department of Mathematics and Statistics, La Trobe University, Bundoora Melbourne 3066, Victoria Australia. (e-mail: [email protected]) 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
FLORENCE FORBES2
1Department of Mathematics and Statistics, La Trobe University, Bundoora Melbourne 3066, Victoria Australia. (e-mail: [email protected]) 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
Abstract
Approximate Bayesian computation (ABC) has become an essential part of the Bayesian toolbox for addressing problems in which the likelihood is prohibitively expensive or entirely unknown, making it intractable. ABC defines a pseudo-posterior by comparing observed data with simulated data, traditionally based on some summary statistics, the elicitation of which is regarded as a key difficulty. Recently, using data discrepancy measures has been proposed in order to bypass the construction of summary statistics. Here we propose to use the importance-sampling ABC (IS-ABC) algorithm relying on the so-called two-sample energy statistic. We establish a new asymptotic result for the case where both the observed sample size and the simulated data sample size increase to infinity, which highlights to what extent the data discrepancy measure impacts the asymptotic pseudo-posterior. The result holds in the broad setting of IS-ABC methodologies, thus generalizing previous results that have been established only for rejection ABC algorithms. Furthermore, we propose a consistent V-statistic estimator of the energy statistic, under which we show that the large sample result holds, and prove that the rejection ABC algorithm, based on the energy statistic, generates pseudo-posterior distributions that achieves convergence to the correct limits, when implemented with rejection thresholds that converge to zero, in the finite sample setting. Our proposed energy statistic based ABC algorithm is demonstrated on a variety of models, including a Gaussian mixture, a moving-average model of order two, a bivariate beta and a multivariate -and- distribution. We find that our proposed method compares well with alternative discrepancy measures.
1 Introduction
In recent years, Bayesian inference has become a popular paradigm for machine learning and statistical analysis. Good introductions and references to the primary methods and philosophies of Bayesian inference can be found in texts such as Press, (2003), Ghosh et al., (2006), Koch, (2007), Koop et al., (2007), Robert, (2007), Barber, (2012), Murphy, (2012). When conducting parametric Bayesian inference, we observe some realization of the data that are generated from some data generating process (DGP), which can be characterized by a parametric likelihood, given by a probability density function (PDF) , determined entirely via the parameter vector . Using expert knowledge, or based on computational considerations such as conjugacy, we endow the parameter with some prior PDF . The goal of Bayesian inference is then to characterize the posterior distribution
[TABLE]
where the prior predictive distribution is defined by
[TABLE]
In very simple cases, such as cases when the prior PDF is a conjugate of the likelihood (cf. Sec. 3.3 of Robert, (2007)), the posterior PDF (1) can be expressed explicitly. In the case of more complex but still tractable pairs of likelihood and prior PDFs, one can sample from (1) via a variety of Monte Carlo methods, such as those reported in Ch. 6 of Press, (2003).
In cases where the likelihood function is known but not tractable, or when the likelihood function has entirely unknown form, one cannot exactly sample from (1) in an inexpensive manner, or at all. In such situations, a sample from an approximation of (1) may suffice in order to conduct the userâs desired inference. Such a sample can be drawn via the method of approximate Bayesian computation (ABC).
It is generally agreed that the ABC paradigm originated from the works of Rubin, (1984), Pritchard et al., (1999); see TavarÊ, (2019) for details. Stemming from the initial listed works, there are now numerous variants of ABC methods. Some good reviews of the current ABC literature can be found in the expositions of Marin et al., (2012), Voss, (2014), Lintusaari et al., (2017), Karabatsos and Leisen, (2018). The volume Sisson et al., (2019) provides a comprehensive treatment regarding ABC methodologies.
The core philosophy of ABC is to define a pseudo-posterior by comparing data with plausibly simulated replicates. The comparison is traditionally based on some summary statistics, the choice of which being regarded as a key challenge of the approach.
In recent years, data discrepancy measures bypassing the construction of summary statistics have been proposed by viewing data sets as empirical measures. Recent examples of such an approach include the use of the maximum mean discrepancy (MMD) (Park et al.,, 2016), KullbackâLeibler divergence (Jiang et al.,, 2018), and the Wasserstein distance (Bernton et al.,, 2019). Furthermore, Jiang et al., (2018) also considered the use of the classification accuracy method of Gutmann et al., (2018), and the indirect inference method of Drovandi et al., (2015), in the data discrepancy context.
In this article, we develop upon the discrepancy measurement approach of Jiang et al., (2018), via the importance sampling ABC (IS-ABC) approach, which makes use of a weight function; see e.g. Karabatsos and Leisen, (2018). In particular, we report on a class of ABC algorithms that utilize the two-sample energy statistic (ES) of Szekely and Rizzo, (2004) (see also Baringhaus and Franz, (2004), Szekely and Rizzo, (2013, 2017), Mak and Joseph, (2018)). Our approach is related to the MMD ABC algorithms that were implemented in Park et al., (2016), Jiang et al., (2018), Bernton et al., (2019). The MMD is a discrepancy measurement that is closely related to the ES, cf. Sejdinovic et al., (2013).
We establish new asymptotic results that have not been proved in these previous papers. In the IS-ABC setting and in the regime where both the observation sample size and the simulated data sample size increase to infinity, our theoretical result highlights how the data discrepancy measure impacts the asymptotic pseudo-posterior. More specifically, we make the assumption that the data discrepancy measure converges to some asymptotic value , where stands for the âtrueâ parameter value associated to the DGP that generates observations . We then show that the pseudo-posterior distribution converges almost surely to a distribution depending on the prior and on the limiting value . In addition to our asymptotic results regarding large sample scenarios, we also provide corollaries regarding the performance of our ES-based ABC method, due to the general finite sample theoretical results of Bernton et al., (2019). Our asymptotic results provide useful approximations and guarantees for the practical application of our method.
The last decade has seen an active development in research on asymptotic properties of ABC. Early works revolved around the impact of the acceptance threshold on the ABC bias and the Monte Carlo error (Blum, 2010a, , Barber et al.,, 2015, Biau et al.,, 2015), and on the choice of summary statistics (Blum, 2010a, , Fearnhead and Prangle,, 2012, Prangle et al.,, 2014). Further works focused on large sample size properties such as consistency for model choice (Marin et al.,, 2014), asymptotic efficiency (Li and Fearnhead,, 2018), posterior consistency, and contraction rates (Frazier et al.,, 2018). It is with these results, where our article fits. Although devised in settings where likelihoods are assumed intractable, ABC can also be cast in the setting of robustness with respect to misspecification (Frazier et al.,, 2020). In particular, the ABC posterior distribution can be viewed as a special case of a coarsened posterior distribution (Miller and Dunson,, 2018).
The remainder of the article proceeds as follows. In Section 2, we introduce the general IS-ABC framework. In Section 3, we introduce the two-sample ES and demonstrate how it can be incorporated into the IS-ABC framework. Theoretical results regarding the IS-ABC framework and the two-sample ES are presented in Section 4. Illustrations of the IS-ABC framework are presented in Section 5. Conclusions are drawn in Section 6.
2 Importance sampling ABC
Assume that we observe independent and identically distributed (IID) replicates of from some DGP, which we put into . We suppose that the DGP that generates is dependent on some parameter vector from space , which is random and has prior PDF .
Denote to be the PDF of , given , and write
[TABLE]
where is a realization of , and each is a realization of ().
If were known, then we could use (1) to write the posterior PDF
[TABLE]
where is a constant that makes . When evaluating is prohibitive and ABC is required, then operating with is similarly difficult. We suppose that given any , we at least have the capability of sampling from the DGP with PDF . That is, we have a simulation method that allows us to feasibly sample the IID vector , for any , for a DGP with PDF
[TABLE]
Typically, one should choose , as it fulfils the hypotheses of all of our proved theoretical results. This choice is made throughout all of our numerical demonstrations. However, we anticipate that there may be practical or computational scenarios, where it may be advantageous to be able to choose , which is permissible in our methodological framework.
Using the simulation mechanism that generates samples and the prior distribution that generates parameters , we can simulate a set of simulations , where and is the transposition operator. Here, for each , is an observation from the DGP with joint PDF , hence each is composed of a parameter value and a datum conditional on the parameter value. We now consider how and can be combined in order to construct an approximation of (2).
Following the approach of Jiang et al., (2018), we define to be some non-negative real-valued function that outputs a small value if and are similar, and outputs a large value if and are different, in some sense. We call the data discrepancy measurement between and , and we say that is the data discrepancy function.
Next, we let be a non-negative, decreasing (in ), and bounded (importance sampling) weight function (cf. Section 3 of Karabatsos and Leisen, (2018)), which takes as inputs a data discrepancy measurement and a calibration parameter . Using the weight and discrepancy functions, we can propose the following approximation for (2).
In the language of Jiang et al., (2018), we call
[TABLE]
the pseudo-posterior PDF, where
[TABLE]
is the approximate likelihood function, and
[TABLE]
is a normalization constant. We can use (3) to approximate (2) in the following way. For any functional of the parameter vector of interest, say, we may approximate the posterior mean Bayesian estimator of via the expression
[TABLE]
where the right-hand side of (4) can be unbiasedly estimated using via
[TABLE]
We call the process of constructing (5), to approximate (4), the IS-ABC procedure. The general form of the IS-ABC procedure is provided in Algorithm 1.
Algorithm 1**.**
IS-ABC procedure for approximating .
Input: a data discrepancy function , a weight function , and a calibration parameter .
For ;
sample from PDF ;
generate from the DGP with PDF ;
put into .
Output: and construct the estimator .
3 The energy statistic (ES)
Let define a metric and let and be two random variables that are in a space endowed with a semi-metric , where (cf. Sejdinovic et al., (2013)). Furthermore, let and be two random variables that have the same distributions as and , respectively. Here, , , , and are all independent of one another.
Upon writing
[TABLE]
we can define the original ES of Baringhaus and Franz, (2004) and Szekely and Rizzo, (2004), as a function of and , via the expression , where is the power of the metric corresponding to the (; cf. (Szekely and Rizzo,, 2013, Prop. 2)). Thus, the original ES statistic, which we shall also denote as , is defined using the Euclidean metric .
The original ES has numerous useful mathematical properties. For instance, under the assumption that , it was shown that
[TABLE]
in Proposition 1 of Szekely and Rizzo, (2013), where is the gamma function and (respectively, ) is the characteristic function of (respectively, ). Thus, we have the fact that for any , and if and only if and are identically distributed.
The result above is generalized in Proposition 3 of Szekely and Rizzo, (2013), where we have the following statement. If is a continuous function and are independent random variables, then it is necessary and sufficient that is strictly negative definite (see Szekely and Rizzo, (2013) for the precise definition) for the following conclusion to hold: for any , and if and only if and are identically distributed.
We observe that there is thus an infinite variety of functions from which we can construct energy statistics. We shall concentrate on the use of the original ES, based on , since it is the most well known and popular of the varieties.
3.1 The V-statistic estimator
Suppose that we observe and , where the former is a sample containing IID replicates of , and the latter is a sample containing IID replicates of , respectively, with and being independent. In Gretton et al., (2012), it was shown that for any , upon assuming that , the so-called V-statistic estimator (cf. (Serfling,, 1980, Ch. 5) and Koroljuk and Borovskich, (1994))
[TABLE]
can be proved to converge in probability to , as and , under the condition that , for some constant (see also Gretton et al., (2007)). Here, the proof was provided in the context of MMDs (see definition in Section 3.3) but is easily portable to the ES setting.
We note that the assumption of this result is rather restrictive, since it either requires the bounding of the space or the function . In the sequel, we will present a result for the almost sure convergence of the V-statistic that depends on the satisfaction of a more realistic hypothesis.
It is noteworthy that if the ES is non-negative, then the V-statistic retains the non-negativity property of its corresponding ES (cf. Gretton et al., (2012)). That is, for any continuous and negative definite function , we have .
3.2 The ES-based IS-ABC algorithm
From Algorithm 1, we observe that an IS-ABC algorithm requires three components. A data discrepancy measurement , a weighting function , and a tuning parameter . We propose the use of the ES in the place of the data discrepancy measurement , in combination with various weight functions that have been used in the literature. That is we set
[TABLE]
in Algorithm 1.
In particular, we consider original ES, where . We name our framework the ES-ABC algorithm. In Section 4, we shall demonstrate that the proposed algorithm possesses desirable large sample qualities that guarantees its performance in practice, as illustrated in Section 5.
3.3 Related methods
The ES-ABC algorithm that we have presented here is closely related to ABC algorithms based on the maximum mean discrepancy (MMD) that were implemented in Park et al., (2016), Jiang et al., (2018), and Bernton et al., (2019). For each Mercer kernel function (, the corresponding MMD is defined via the equation
[TABLE]
where are random variable such that and are identically distributed to and , respectively.
The MMD as a statistic for testing goodness-of-fit was studied prominently in articles such as Gretton et al., (2007), Gretton et al., (2009), and Gretton et al., (2012). More details regarding the relationship between the two classes of statistics can be found in Sejdinovic et al., (2013).
We note two shortcomings with respect to the applications of the MMD as a basis for an ABC algorithm in the previous literature. Firstly, no theoretical results regarding the consistency of the MMD-based methods have been proved. And secondly, in the application by Park et al., (2016) and Jiang et al., (2018), the MMD was implemented using the unbiased U-statistic estimator, rather than the biased V-statistic estimator. Although both estimators are consistent, in the sense that they can be proved to be convergent to the desired limiting MMD value, the U-statistic estimator has the property of not being bounded from below by zero (cf. Gretton et al., (2012)). As such, it does not meet the strict definition of a data discrepancy measurement.
For a sufficiently large sample size, the U-statistic will have low probability of having a value less than zero, and thus the difference between the U-statistic and V-statistic becomes immaterial for large . One may also consider a truncation of the U-statistic, which causes no issues, asymptotically, as the U-statistic and V-statistic have the same limit, which is guaranteed to be non-negative.
4 Theoretical results
4.1 Behavior as and
4.1.1 Analysis with a generic discrepancy
We now establish a consistency result for the pseudo-posterior density (3), when and approach infinity. Our result generalizes the main result of Jiang et al., (2018) (i.e., Theorem 1), which is the specific case when the weight function is restricted to the form
[TABLE]
where \left\llbracket\cdot\right\rrbracket is the Iverson bracket notation, which equals 1 when the internal statement is true, and 0, otherwise (cf. Graham et al., (1994)).
The weighting function of form (8), when implemented within the IS-ABC framework, produces the common rejection ABC algorithms, that were suggested by TavarÊ et al., (1997), and Pritchard et al., (1999). We extended upon the result of Jiang et al., (2018) so that we may provide theoretical guarantees for more exotic ABC procedures, such as the kernel-smoothed ABC procedure of Park et al., (2016), which implements weights of the form
[TABLE]
for . See Karabatsos and Leisen, (2018) for further discussion and examples.
In order to prove our asymptotic result, we require Huntâs lemma, which is reported in Dellacherie and Meyer, (1980), as Theorem 45 of Section V.5. For convenience to the reader, we present the result, below.
Theorem 1**.**
Let be a probability space with increasing and let . Suppose that is a sequence of random variables that is bounded from above in absolute value by some integrable random variable , and further suppose that converges almost surely to the random variable . Then, almost surely, and in mean, as .
Define the continuity set of a function as
[TABLE]
Using Theorem 1, we can now prove the following result regarding the asymptotic behavior of the pseudo-posterior density function (3).
Theorem 2**.**
Let and be IID samples from DGPs that can be characterized by PDFs and , respectively, with corresponding parameter vectors and . Suppose that the data discrepancy converges to some , which is a function of and , almost surely as , for some . If is piecewise continuous and decreasing in and for all and any , and if
[TABLE]
then we have
[TABLE]
almost surely, as .
Proof.
Using the notation of Theorem 1, we set . Since , for any , we have the existence of a such that is integrable, since we can take . Since converges almost surely to , and is continuous at , we have with probability one by the extended continuous mapping theorem (cf. (DasGupta,, 2011, Thm. 7.10)).
Now, let be the generated by the sequence . Thus, is an increasing , which approaches . We are in a position to directly apply Theorem 1. This yields
[TABLE]
almost surely, as , where the right-hand side equals .
Notice that the left-hand side has the form
[TABLE]
and therefore , almost surely, as . Thus, the numerator of (3) converges to
[TABLE]
almost surely.
To complete the proof, it suffices to show that the denominator of (3) converges almost surely to
[TABLE]
Since and , we obtain our desired convergence via the dominated convergence theorem, because . An application of a continuous mapping theorem (cf. (DasGupta,, 2011, Thm. 7.8)) yields the almost sure convergence of the ratio between (11) and (12) to the right-hand side of (10), as . â
The following result and proof guarantees the applicability of Theorem 2 to rejection ABC procedures, and to kernel-smoothed ABC procedures, as used in Jiang et al., (2018) and Park et al., (2016), respectively.
Proposition 1**.**
The result of Theorem 2 applies to rejection ABC and importance sampling ABC, with weight functions of respective forms (8) and (9).
Proof.
For weights of form (8), we note that w\left(d,\epsilon\right)=\left\llbracket d<\epsilon\right\rrbracket is continuous in at all points, other than when . Furthermore, and is hence non-negative and bounded. Thus, under the condition that , we have the desired conclusion of Theorem 2.
For weights of form (9), we note that for fixed , is continuous and positive in . Since is uniformly bounded by 1, differentiating with respect to , we obtain , which is negative for any and . Thus, (9) constitutes a weight function and satisfies the conditions of Theorem 2. â
4.1.2 Analysis with the energy statistic
We write . From Szekely and Rizzo, (2004) we have the fact that for arbitrary ,
[TABLE]
where
[TABLE]
is the kernel of the V-statistic that is based on the function . The following result is a direct consequence of Theorem 1 of Sen, (1977), when applied to V-statistics constructed from functionals that satisfy the hypothesis of (Szekely and Rizzo,, 2013, Prop. 3).
Lemma 1**.**
Make the same assumptions regarding and as in Theorem 2. Let be a continuous and strictly negative definite function. If
[TABLE]
then converges almost surely to , as , where and are arbitrary elements of and , respectively.
We may apply the result of Lemma 1 directly to the case of in order to provide an almost sure convergence result regarding the V-statistic .
Corollary 1**.**
Make the same assumptions regarding and as in Theorem 2. If and are arbitrary elements of and , respectively, and
[TABLE]
and if , then converges almost surely to
[TABLE]
where is the characteristic function corresponding to the PDF .
Proof.
By the law of total expectation, we apply Lemma 1 by considering the two cases of (13): when and when , separately, to write
[TABLE]
where and . The first term on the right-hand side of (16) is equal to zero, since , whenever . Thus, we need only be concerned with bounding the second term.
For , , thus
[TABLE]
The condition that is thus fulfilled if , which is equivalent to
[TABLE]
by virtue of the integrability of implying the existence of
[TABLE]
since it is defined on a bounded support.
Next, by the triangle inequality,
[TABLE]
and hence
[TABLE]
Since are all pairwise independent, and and are identically distributed to and , respectively, we have
[TABLE]
which concludes the proof since is satisfied by the hypothesis and implies . â
We note that condition (14) is stronger than a direct application of condition (13), which may be preferable in some situations. However, condition (14) is somewhat more intuitive and verifiable since it is concerned with the polynomial moments of norms and does not involve the piecewise function . It is also suggested in Zygmund, (1951) that one may replace by if it is more convenient to do so. We further note that (14) is required for establishing almost sure convergence, and is stronger than what is needed to ensure convergence in probability, as is established in Szekely and Rizzo, (2004) and Gretton et al., (2012).
Combining the result of Theorem 2 with Corollary 1 and the conclusion from Proposition 1 of Szekely and Rizzo, (2013) provided in Equation (15) yields the key result below. This result justifies the use of the V-statistic estimator for the energy distance within the IS-ABC framework, and is comparable to Corollaries 1â3 of Jiang et al., (2018) regarding the large sample asymptotics of other discrepancy measurements.
Corollary 2**.**
Under the assumptions of Corollary 1. If , then the conclusion of Theorem 2 follows with
[TABLE]
almost surely, as , where and , if and only if .
4.2 Behavior as
Let be the set of probability distributions on . From (Sejdinovic et al.,, 2013, Thm. 22), we have the fact that is a metric on , where and have data generating process that are characterized by and , respectively. As such, when we take and arising from two empirical distributions with an equal number of masses (defined on and , for instance), then we obtain the fact that if and only if the two empirical distributions are the same. In other words, and are equal, in the sense that the elements of and are equal up to a permutation. Proposition 2 of Bernton et al., (2019) then provides the following result in the case when is fixed.
Proposition 2**.**
Assume that has form (8), where , and that is a continuous and exchangeable PDF. Furthermore, assume that
[TABLE]
and suppose that there exists some , where
[TABLE]
Then, for fixed , the pseudo-posterior PDF (3) converges strongly to the posterior PDF (2), as .
Let us suppose that the empirical distribution of is denoted and that each observation of is generated from a process that can be characterized by the distribution (we write the joint distribution of as ). We shall also write as the probability distribution corresponding to the PDF , and as the empirical distribution obtained from a sample with data generating process that is characterized by .
Next, we let the probability distribution corresponding to the prior and pseudo-posterior PDFs of the ES-based ABC process with rejection weights (i.e. and (3)) as and , respectively. And finally, let us denote the probability of the set with respect to the probability distribution as . In order to state our next result, we require the following assumptions.
A1
The data generating process of is such that, for every ,
[TABLE]
A2
For every ,
[TABLE]
where is a sequence of functions that is strictly decreasing in for all , and as , for fixed . Here: is a positive function that is integrable with respect to and satisfies for some , for all such that, for some , , where .
A3
There exists an and a such that, for each sufficiently small ,
[TABLE]
Upon making Assumptions A1âA3, we may apply the proof process of (Bernton et al.,, 2019, Prop. 3) directly, replacing the Wasserstein metric with the energy metric , where appropriate. Such a process yields the following result.
Proposition 3**.**
Along with A1âA3, assume that there exists a sequence , such that , and , as . Then, the ES-based ABC algorithm with , discrepancy , and rejection weights using satisfies the inequality
[TABLE]
for some , with probability going to 1 as (with respect to ).
The hypotheses of Proposition 2 are straight forward and the conclusion implies that pseudo-posterior PDF of the ES-based ABC procedure can approximate the posterior PDF, based on the likelihood of the data generating process of , to an arbitrary level of accuracy, when is made sufficiently small. This however does not mean that one should make too small in practice, as the effort required to simulate data will become more difficult and the process becomes more computationally intensive in such cases. Note that the value of is often chosen in a pragmatic way as a quantile (of a small order, usually less than 5%) of all the distances that are obtained in the ABC sample, thus deciding how many samples are kept as a fraction of the entire ABC replications. This procedure was used in the ABC algorithms of Beaumont et al., (2002), Blum, 2010b , and Jabot et al., (2013).
Next, we note that the assumptions (other than A1, which is generally true for stationary and ergodic data; cf. (Szekely and Rizzo,, 2013, Secs. 7 and 8)) and the conclusion of Proposition 3 are more complex. Due to the lack of ease by which A2 and A3 may be validated, the proposition is more useful as an existence result regarding what can be expected in theory, with respect to how quickly the ES-based ABC algorithm converges in , rather than providing any practical guidance. A suggestion by Bernton et al., (2019) is that one may potentially apply the theory of Fournier and Guillin, (2015) and Weed et al., (2019) in order to validate assumption A2.
Under further assumptions, the concentration with respect to the discrepancy in distributions can be transferred to a concentration result, with respect to parameter vector in the space (cf. (Bernton et al.,, 2019, Cor. 1)).
4.3 Illustration on a simple example
We use to denote that the random variable has probability law . Furthermore, we denote the normal law by , where states that the DGP of is multivariate normal distribution with mean vector and covariance matrix .
For illustrating the theoretical results, we investigate the pseudo-posterior limit on a simple univariate Gaussian location model (with known variance ) with conjugate Gaussian prior (with variance fixed). We have IID observations , and IID replicates . The posterior is , where
[TABLE]
In this simple model, the limiting data discrepancy takes the form (up to a proportionality constant) of for the energy distance and KullbackâLeibler divergence, and for the MMD and the (second order) Wasserstein distance.
Theorem 2 establishes that the large and limit of the pseudo-posterior is the distribution that we denote here by . For illustrative purposes, let us focus on the case when , and consider rejection ABC with w\left(d,\epsilon\right)=\left\llbracket d<\epsilon\right\rrbracket and IS-ABC with . The limiting pseudo-posterior can then be obtained in closed-form as
[TABLE]
a truncated Gaussian for rejection ABC and
[TABLE]
for IS-ABC, where and . See Figure 1 for an illustration, for various values of .
5 Illustrations
We illustrate the use of the ES on some standard models. The standard rejection ABC algorithm is employed (that is, we use Algorithm 1 with weight function of form (8)) for constructing estimators (5). The proposed ES is compared to the KullbackâLeibler divergence (KL), the Wasserstein distance (WA), and the maximum mean discrepancy (MMD). Here, the ES is applied using the Euclidean metric , the Wasserstein distance using the exponent and the approximation by the swapping distance (Bernton et al.,, 2019) and the MMD using a Gaussian kernel . The Gaussian kernel is commonly used in the MMD literature, and was also considered for ABC in Park et al., (2016) and Jiang et al., (2018). Details regarding the use of the KullbackâLeibler divergence as a discrepancy function for ABC algorithms can be found in Sec. 2 of Jiang et al., (2018). With respect to the theoretical results of Section 4, the chosen examples can be shown to be sufficiently regular as to validate the hypotheses of Corollary 2 and Proposition 2. However, we believe that it would be difficult to validate Assumptions A2 and A3 of Proposition 3, without further theoretical development.
We consider examples explored in (Jiang et al.,, 2018, Sec. 4.1). For each illustration below, we sample synthetic data of the same size as the observed data size, , whose value is specified for each model below. The ABC procedure is sensitive to the choice of the prior; we follow the benchmark examples of Jiang et al., (2018) by employing the same uniform priors, as specified in each example. The number of ABC iterations in Algorithm 1 is set to . The tuning parameter is set so that only the smallest discrepancies are kept to form ABC posterior sample. We postpone the discussion of the results of our simulation experiments to Section 5.5
The experiments were implemented in R, using in particular the winference package (Bernton et al.,, 2019) and the FNN package (Beygelzimer et al.,, 2013). The KullbackâLeibler divergence between two PDFs is computed within the -nearest neighbor framework (Boltz et al.,, 2009). Moreover, the -d trees is adopted for implementing the nearest neighbor search, which is the same as the method of Jiang et al., (2018). For estimating the -Wasserstein distance between two multivariate empirical measures, we propose to employ the swapping algorithm (Puccetti,, 2017), which is simple to implement, and is more accurate and less computationally expensive than other algorithms commonly used in the literature (Bernton et al.,, 2019). Regarding the MMD, the same unbiased U-statistic estimator is adopted as given in Jiang et al., (2018) and Park et al., (2016). For reproduction of the the experimental results, the original source code can be accessed at https://github.com/hiendn/Energy_Statistics_ABC.
5.1 Bivariate Gaussian mixture model
Let be a sequence of IID random variables, such that each has a mixture of bivariate Gaussian probability law
[TABLE]
with known covariance matrices
[TABLE]
We aim to estimate the generative parameters consisting of the mixing probability and the population means and . We denote the uniform law, in the interval , for , by . The priors on the model parameters are uniform; that is, , and . We perform ABC using observations, sampled from model (19) with , and . A kernel density estimate (KDE) of the ABC posterior distribution (bivariate marginals of and ) is presented in Figure 2.
5.2 Moving-average model of order 2
The moving-average model of order , MA(), is a stochastic process defined as
[TABLE]
with being a sequence of unobserved noise error terms. Jiang et al., (2018) used a MA model for their benchmarking; namely . Each observation corresponds to a time series of length . Here, we use the same model as that proposed in Jiang et al., (2018), where follows the Student- distribution with degrees of freedom, and . The priors on the model parameters and are taken to be uniform, that is, and . We performed ABC using samples generated from a model with the true parameter values . A KDE of the ABC joint posterior distribution of is displayed in Figure 3.
5.3 Bivariate beta model
The bivariate beta model proposed by Crackel and Flegal, (2017) is defined with five positive parameters by letting
[TABLE]
where , for , and setting and . The bivariate random variable has marginal laws and . We performed ABC using samples of size , which are generated from a DGP with true parameter values . The prior on each of the model parameters is taken to be independent . KDEs of the marginal ABC posterior distributions of parameters and are displayed in Figure 4.
5.4 Multivariate g-and-k distribution
A univariate -and- distribution can be defined via its quantile function (Drovandi and Pettitt,, 2011):
[TABLE]
where parameters respectively relate to location, scale, skewness, and kurtosis. Here, is the th quantile of the standard normal distribution. Given a set of parameters , it is easy to simulate observations of a DGP with quantile function (21), by generating a sequence of IID sample , where , for .
A so-called -dimensional -and- DGP can instead be defined by applying the quantile function (21) to each of the elements of a multivariate normal vector , where is a covariance matrix. In our experiment, we use a 5-dimensional -and- model with the same covariance matrix and parameter values for as that considered by Jiang et al., (2018). That is, we generate samples of size from a -and- DGP with the true parameter values and the covariance matrix
[TABLE]
where . The prior on the model parameters is taken to be independent , while is independently assigned a prior. KDEs of the marginal ABC posterior distributions of parameters and are displayed in Figure 5.
5.5 Discussion of the results and performance
For each of the four experiments and each parameter, we computed the posterior mean , posterior median , mean absolute error and mean squared error defined by
[TABLE]
where denotes the pseudo-posterior sample and denotes the true parameter. Here since and is chosen as to retain of the samples. Each experiment was replicated ten times by keeping the same fixed (true) values for the parameters and by sampling new observed data each of the ten times. The estimated quantities , , and errors MAE and were then averaged over the ten replications, and are reported along with standard deviations in columns associated with each estimator and true values for each parameter in Tables 1, 2, 3 and 5.
Upon inspection, Tables 1, 2, 3 and 5 showed some advantage in performance from WA on the bivariate Gaussian mixtures, some advantage from the MMD on the bivariate beta model, and some advantage from the ES on the -and- model, while multiple methods are required to make the best inference in the case of the MA experiment. When we further take into account the standard deviations of the estimators, we observe that all four data discrepancy measures essentially perform comparatively well across the four experimental models. Thus, we may conclude that there is no universally best performing discrepancy measure. Some considerations are therefore necessary when choosing between discrepancies. The first point of consideration is whether the data are random variables arising from continuous or discrete measures. In the case that the data arises from a discrete measure, the KL discrepancy measure is not applicable, since it is not defined on a set of measure greater than zero. Another consideration regarding the choice of discrepancy measures is the computational complexity of each discrepancy measure, as is summarized in Table 4.
From Table 4, we firstly note that in the case of univariate data, all methods have the same computational complexity, as all of the discrepancy measures amount to comparisons between the order statistics of the observed and simulated data. Computational complexity becomes a greater separating criterion when considering the multivariate setting. In the multivariate case, the KL divergence is clearly faster than the other methods, but as mentioned before, is not applicable for discrete data. The ES and MMD methods share the same order of complexity, , due to their theoretical equivalence (cf. Sejdinovic et al., (2013)). It is notable that, in general, the computational complexity of the WA discrepancy is of order , which greater than that of the ES and MMD discrepancies, and is thus a significantly slower method when and get large. However, in our numerical results, we have used the swapping distance approximation of the WA method, as was considered in Bernton et al., (2019). Although this approximation is faster than the exact WA discrepancy, it does not converge to the same value, in general, and thus theoretical results regarding the WA discrepancy cannot be directly applied to the approximation (although some theoretical statements are still available). Thus, there is a trade-off regarding theoretical outcomes when using the swap distance approximation.
We note that in the case when the MMD discrepancy measure is estimated by the V-statistic estimator, much of our theoretical results from Section 4, are applicable with minor modifications, due to the results of Sejdinovic et al., (2013). Thus, the choice between the ES and the MMD method comes down to a preference for the use of kernels or metrics. A consideration regarding the choice of the MMD discrepancy versus the ES discrepancy is that, to the best of our knowledge, a comparable result to (6) does not exist for any common kernel choice.
As an alternative to choosing one of the assessed discrepancy measures, one may also consider some kind of averaging over the results of the different discrepancy measures. We have not committed to an investigation of such methodologies and leave it as a future research direction.
Running times (on a MacBook Pro 3,1 GHz) for the ES, KL, MMD and WA distance computations for ABC replications, in the four models considered in the simulations, for varying sample sizes , and with , are reported in Figure 6. ES is uniformly much faster than the other approaches for small samples sizes, up to the value of , where it is performing as fast as KL. For sample sizes larger than , KL is fastest. Overall, MMD and WA are slower than ES and KL.
6 Conclusion
We have introduced a novel importance-sampling ABC algorithm that is based on the so-called two-sample energy statistic. Along with other data discrepancy measures that view data sets as empirical measures, such as the KullbackâLeibler divergence, the Wasserstein distance and maximum mean discrepancies, our proposed approach bypasses the cumbersome use of summary statistics.
We have shown that the V-statistic estimator of the ES is consistent under mild moment conditions. Furthermore, we have established a new asymptotic result for cases when the observed sample and simulated sample sizes increasing to infinity, that shows a kind of consistency of the pseudo-posterior in the infinite data scenario. This is in concordance with previous results in such cases (see for instance Jiang et al.,, 2018, Bernton et al.,, 2019) and extends upon existing theory for the application in the general IS-ABC framework. That is, we largely extend the main result of Jiang et al., (2018), regarding the large sample properties of the pseudo-posterior PDF, to the IS-ABC cases that are considered in Karabatsos and Leisen, (2018) and Park et al., (2016). Thus, we provide further theoretical justification for the usage of such algorithms.
Illustrations of the proposed ES-ABC algorithm on four experimental models have shown that it performs comparatively well to alternative discrepancy measures.
Considering computing costs, the ES, KL, MMD, and WA estimators in univariate settings are all equal in terms of order of complexity, with a linearithmic computational time of (see Huo and SzÊkely, (2016), Chaudhuri and Hu, (2019), regarding the complexity of the ES and MMD estimators). In multivariate settings, KL complexity is unchanged; ES and MMD have quadratic time , while the Wasserstein distance has complexity . The latter can be reduced to quadratic complexity if one is targetting the swapping distance, an approximation of the actual Wasserstein distance (Bernton et al.,, 2019). We note that linear time estimators are also available for the MMD and the ES, if one is willing to forgo precision in the estimates (see Gretton et al., (2012)). See Table 4 for a summary.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Barber, (2012) Barber, D. (2012). Bayesian Reasoning and Machine Learning . Cambridge University Press, Cambridge.
- 2Barber et al., (2015) Barber, S., Voss, J., and Webster, M. (2015). The rate of convergence for approximate Bayesian computation. Electronic Journal of Statistics , 9(1):80â105.
- 3Baringhaus and Franz, (2004) Baringhaus, L. and Franz, C. (2004). On a new multivariate two-sample test. Journal of Multivariate Analysis , 88:190â206.
- 4Beaumont et al., (2002) Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate bayesian computation in population genetics. Genetics , 162:2025â2035.
- 5Bernton et al., (2019) Bernton, E., Jacob, P. E., Gerber, M., and Robert, C. P. (2019). Approximate Bayesian computation with the Wasserstein distance. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 81:235â269.
- 6Beygelzimer et al., (2013) Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., and Li, S. (2013). FNN: Fast Nearest Neighbor Search miller and Applications . R package version 1.1.3.
- 7Biau et al., (2015) Biau, G., CĂŠrou, F., and Guyader, A. (2015). New insights into approximate Bayesian computation. Annales de lâIHP ProbabilitĂŠs et statistiques , 51(1):376â403.
- 8(8) Blum, M. G. (2010 a). Approximate Bayesian computation: a nonparametric perspective. Journal of the American Statistical Association , 105(491):1178â1187.
