Approximate Bayesian computation via the energy statistic

Hien D. Nguyen; Julyan Arbel; Hongliang L\"u; Florence Forbes

arXiv:1905.05884·stat.ME·July 14, 2020·IEEE Access

Approximate Bayesian computation via the energy statistic

Hien D. Nguyen, Julyan Arbel, Hongliang L\"u, Florence Forbes

PDF

1 Repo

TL;DR

This paper introduces an energy statistic-based ABC algorithm that improves likelihood-free Bayesian inference by providing asymptotic guarantees and demonstrating competitive performance across various models.

Contribution

It proposes a novel ABC method using the energy statistic, with new asymptotic results and a consistent estimator, enhancing the robustness of likelihood-free inference.

Findings

01

The energy statistic-based ABC converges to the true pseudo-posterior.

02

The method performs well compared to alternative discrepancy measures.

03

Asymptotic results hold for increasing sample sizes in both observed and simulated data.

Abstract

Approximate Bayesian computation (ABC) has become an essential part of the Bayesian toolbox for addressing problems in which the likelihood is prohibitively expensive or entirely unknown, making it intractable. ABC defines a pseudo-posterior by comparing observed data with simulated data, traditionally based on some summary statistics, the elicitation of which is regarded as a key difficulty. Recently, using data discrepancy measures has been proposed in order to bypass the construction of summary statistics. Here we propose to use the importance-sampling ABC (IS-ABC) algorithm relying on the so-called two-sample energy statistic. We establish a new asymptotic result for the case where both the observed sample size and the simulated data sample size increase to infinity, which highlights to what extent the data discrepancy measure impacts the asymptotic pseudo-posterior. The result…

Tables5

Table 1. Table 1 : Estimation performance for bivariate Gaussian mixtures (Section 5.1 ). The best results in each column is highlighted in boldface.

		${\hat{θ}}_{mean}$	$σ ({\hat{θ}}_{mean})$	${\hat{θ}}_{med}$	$σ ({\hat{θ}}_{med})$	MAE	$σ (MAE)$	RMSE	$σ (RMSE)$
$μ_{00} = 0.7$	ES	0.594	0.045	0.607	0.063	0.215	0.030	0.283	0.055
	KL	0.648	0.039	0.666	0.048	0.165	0.016	0.205	0.026
	WA	0.675	0.035	0.682	0.043	0.152	0.020	0.181	0.021
	MMD	0.564	0.079	0.582	0.076	0.234	0.054	0.311	0.101
$μ_{01} = 0.7$	ES	0.587	0.063	0.613	0.059	0.215	0.038	0.282	0.069
	KL	0.651	0.042	0.667	0.061	0.169	0.022	0.210	0.027
	WA	0.655	0.050	0.669	0.047	0.152	0.015	0.187	0.019
	MMD	0.559	0.076	0.598	0.075	0.235	0.049	0.313	0.092
$μ_{10} = - 0.7$	ES	-0.699	0.046	-0.716	0.040	1.401	0.043	1.412	0.039
	KL	-0.709	0.029	-0.712	0.035	1.409	0.029	1.415	0.029
	WA	-0.699	0.030	-0.704	0.037	1.399	0.030	1.404	0.030
	MMD	-0.709	0.054	-0.731	0.036	1.411	0.051	1.422	0.038
$μ_{11} = - 0.7$	ES	-0.696	0.058	-0.712	0.043	1.396	0.058	1.407	0.049
	KL	-0.711	0.047	-0.704	0.057	1.411	0.047	1.416	0.047
	WA	-0.695	0.043	-0.695	0.053	1.395	0.043	1.401	0.043
	MMD	-0.711	0.066	-0.726	0.046	1.411	0.066	1.424	0.052

Table 2. Table 2 : Estimation performance for the MA( 2 2 2 ) model (Section 5.2 ). The best results in each column is highlighted in boldface.

		${\hat{θ}}_{mean}$	$σ ({\hat{θ}}_{mean})$	${\hat{θ}}_{med}$	$σ ({\hat{θ}}_{med})$	MAE	$σ (MAE)$	RMSE	$σ (RMSE)$
$θ_{1} = 0.6$	ES	0.569	0.042	0.570	0.045	0.083	0.015	0.100	0.017
	KL	0.664	0.028	0.658	0.031	0.106	0.017	0.132	0.019
	WA	0.509	0.033	0.505	0.038	0.112	0.022	0.133	0.026
	MMD	0.583	0.044	0.586	0.048	0.079	0.013	0.096	0.015
$θ_{2} = 0.2$	ES	0.215	0.035	0.219	0.035	0.111	0.015	0.135	0.019
	KL	0.274	0.023	0.280	0.027	0.110	0.014	0.134	0.014
	WA	0.205	0.025	0.207	0.030	0.090	0.029	0.112	0.034
	MMD	0.220	0.037	0.220	0.036	0.108	0.010	0.132	0.012

Table 3. Table 3 : Estimation performance for the bivariate beta model (Section 5.3 ). The best results in each column is highlighted in boldface.

		${\hat{θ}}_{mean}$	$σ ({\hat{θ}}_{mean})$	${\hat{θ}}_{med}$	$σ ({\hat{θ}}_{med})$	MAE	$σ (MAE)$	RMSE	$σ (RMSE)$
$θ_{1} = 1.0$	ES	1.299	0.223	1.189	0.264	0.713	0.130	0.885	0.165
	KL	1.389	0.190	1.333	0.165	0.696	0.151	0.877	0.205
	WA	1.286	0.220	1.193	0.265	0.672	0.128	0.828	0.153
	MMD	1.229	0.188	1.143	0.241	0.676	0.092	0.836	0.121
$θ_{2} = 1.0$	ES	1.362	0.185	1.290	0.237	0.716	0.118	0.904	0.131
	KL	1.235	0.152	1.153	0.170	0.588	0.070	0.745	0.097
	WA	1.292	0.196	1.240	0.241	0.657	0.114	0.817	0.139
	MMD	1.268	0.173	1.170	0.171	0.669	0.103	0.841	0.131
$θ_{3} = 1.0$	ES	1.170	0.132	1.183	0.157	0.459	0.045	0.552	0.049
	KL	1.083	0.100	1.077	0.088	0.394	0.034	0.496	0.045
	WA	1.229	0.118	1.216	0.132	0.426	0.054	0.521	0.059
	MMD	1.181	0.116	1.182	0.143	0.456	0.051	0.548	0.061
$θ_{4} = 1.0$	ES	1.128	0.112	1.113	0.138	0.435	0.032	0.534	0.045
	KL	1.133	0.111	1.086	0.135	0.390	0.038	0.498	0.051
	WA	1.218	0.110	1.196	0.108	0.409	0.049	0.514	0.066
	MMD	1.150	0.098	1.133	0.130	0.423	0.041	0.518	0.049
$θ_{5} = 1.0$	ES	1.343	0.096	1.360	0.104	0.428	0.052	0.514	0.059
	KL	1.300	0.087	1.250	0.065	0.384	0.040	0.491	0.061
	WA	1.300	0.101	1.298	0.105	0.370	0.058	0.446	0.066
	MMD	1.258	0.115	1.232	0.120	0.375	0.055	0.454	0.063

Table 4. Table 4 : Computational complexities. See discussion in Section 6 .

	Complexity	References
Univariate (all methods)	$𝒪 ((n + m) \log (n + m))$	Jiang et al., (2018), Bernton et al., (2019), Huo and Székely, (2016), Chaudhuri and Hu, (2019)
KL	$𝒪 ((n + m) \log (n + m))$	Jiang et al., (2018)
Multivariate ES/MMD, WA (approx.)	$𝒪 ({(n + m)}^{2})$	Jiang et al., (2018), Bernton et al., (2019)
Multivariate WA	$𝒪 ({(n + m)}^{5 / 2} \log (n + m))$	Bernton et al., (2019)

Table 5. Table 5 : Estimation performance for the g 𝑔 g -and- k 𝑘 k distribution (Section 5.4 ). The best results in each column is highlighted in boldface.

		${\hat{θ}}_{mean}$	$σ ({\hat{θ}}_{mean})$	${\hat{θ}}_{med}$	$σ ({\hat{θ}}_{med})$	MAE	$σ (MAE)$	RMSE	$σ (RMSE)$
$A = 3.0$	ES	3.024	0.044	3.009	0.047	0.133	0.016	0.170	0.018
	KL	2.955	0.030	2.948	0.033	0.105	0.013	0.128	0.013
	WA	3.043	0.045	3.052	0.067	0.232	0.020	0.277	0.020
	MMD	3.081	0.061	3.062	0.065	0.177	0.029	0.221	0.036
$B = 1.0$	ES	1.046	0.062	1.027	0.079	0.268	0.024	0.322	0.029
	KL	0.918	0.071	0.885	0.068	0.313	0.026	0.375	0.029
	WA	0.894	0.127	0.869	0.136	0.277	0.044	0.334	0.045
	MMD	0.899	0.069	0.855	0.079	0.374	0.029	0.440	0.030
$g = 2.0$	ES	2.289	0.101	2.264	0.210	0.872	0.098	1.026	0.091
	KL	2.993	0.080	3.046	0.121	1.043	0.070	1.193	0.066
	WA	2.581	0.101	2.599	0.147	0.858	0.078	1.025	0.075
	MMD	2.184	0.128	2.227	0.190	0.904	0.103	1.052	0.100
$k = 0.5$	ES	0.476	0.046	0.444	0.067	0.225	0.014	0.270	0.015
	KL	0.550	0.059	0.498	0.064	0.252	0.029	0.317	0.045
	WA	0.544	0.095	0.526	0.094	0.189	0.035	0.238	0.046
	MMD	0.691	0.056	0.621	0.072	0.380	0.041	0.502	0.070
$ρ = - 0.3$	ES	-0.163	0.047	-0.178	0.069	0.197	0.032	0.246	0.034
	KL	-0.291	0.034	-0.324	0.037	0.117	0.014	0.144	0.020
	WA	-0.288	0.026	-0.314	0.035	0.125	0.016	0.152	0.020
	MMD	-0.194	0.047	-0.210	0.063	0.174	0.030	0.218	0.035

Equations119

π (θ ∣ x) = \frac{f ( x ∣ θ ) π ( θ )}{c ( x )},

π (θ ∣ x) = \frac{f ( x ∣ θ ) π ( θ )}{c ( x )},

c (x) = \int_{T} f (x ∣ θ) π (θ) d θ .

c (x) = \int_{T} f (x ∣ θ) π (θ) d θ .

f (x_{n} ∣ θ) = i = 1 \prod n f (x_{i} ∣ θ),

f (x_{n} ∣ θ) = i = 1 \prod n f (x_{i} ∣ θ),

π (θ ∣ x_{n}) = \frac{f ( x _{n} ∣ θ ) π ( θ )}{c ( x _{n} )},

π (θ ∣ x_{n}) = \frac{f ( x _{n} ∣ θ ) π ( θ )}{c ( x _{n} )},

f (y_{n} ∣ θ) = i = 1 \prod m f (y_{i} ∣ θ) .

f (y_{n} ∣ θ) = i = 1 \prod m f (y_{i} ∣ θ) .

π_{m, ϵ} (θ ∣ x_{n}) = \frac{π ( θ ) L _{m, ϵ} ( x _{n} ∣ θ )}{c _{m, ϵ} ( x _{n} )}

π_{m, ϵ} (θ ∣ x_{n}) = \frac{π ( θ ) L _{m, ϵ} ( x _{n} ∣ θ )}{c _{m, ϵ} ( x _{n} )}

L_{m, ϵ} (x_{n} ∣ θ) = \int_{X^{m}} w (D (x_{n}, y_{m}), ϵ) f (y_{m} ∣ θ) d y_{m}

L_{m, ϵ} (x_{n} ∣ θ) = \int_{X^{m}} w (D (x_{n}, y_{m}), ϵ) f (y_{m} ∣ θ) d y_{m}

c_{m, ϵ} (x_{n}) = \int_{T} π (θ) L_{m, ϵ} (x_{n} ∣ θ) d θ

c_{m, ϵ} (x_{n}) = \int_{T} π (θ) L_{m, ϵ} (x_{n} ∣ θ) d θ

E [g (θ) ∣ x_{n}] \approx \frac{\int _{T} g ( θ ) π ( θ ) L _{m, ϵ} ( x _{n} ∣ θ ) d θ}{c _{m, ϵ} ( x _{n} )},

E [g (θ) ∣ x_{n}] \approx \frac{\int _{T} g ( θ ) π ( θ ) L _{m, ϵ} ( x _{n} ∣ θ ) d θ}{c _{m, ϵ} ( x _{n} )},

M [g (θ) ∣ x_{n}] = \frac{\sum _{k = 1}^{N} g ( θ _{k} ) w ( D ( x _{n} , Y _{m, k} ) , ϵ )}{\sum _{k = 1}^{N} w ( D ( x _{n} , Y _{m, k} ) , ϵ )} .

M [g (θ) ∣ x_{n}] = \frac{\sum _{k = 1}^{N} g ( θ _{k} ) w ( D ( x _{n} , Y _{m, k} ) , ϵ )}{\sum _{k = 1}^{N} w ( D ( x _{n} , Y _{m, k} ) , ϵ )} .

E_{δ} (X, Y) = 2 E [δ (X, Y)] - E [δ (X, X^{'})] - E [δ (Y, Y^{'})],

E_{δ} (X, Y) = 2 E [δ (X, Y)] - E [δ (X, X^{'})] - E [δ (Y, Y^{'})],

E (X, Y) = \frac{Γ ( \frac{d + 1}{2} )}{π ^{(d + 1) /2}} \int_{R^{d}} \frac{∣ φ _{X} ( t ) - φ _{Y} ( t ) ∣ ^{2}}{∥ t ∥ _{2}^{d + 1}} d t,

E (X, Y) = \frac{Γ ( \frac{d + 1}{2} )}{π ^{(d + 1) /2}} \int_{R^{d}} \frac{∣ φ _{X} ( t ) - φ _{Y} ( t ) ∣ ^{2}}{∥ t ∥ _{2}^{d + 1}} d t,

V_{δ} (X_{n}, Y_{m}) =

V_{δ} (X_{n}, Y_{m}) =

-

-

D (X_{n}, Y_{m}) = V_{δ} (X_{n}, Y_{m}),

D (X_{n}, Y_{m}) = V_{δ} (X_{n}, Y_{m}),

MMD_{χ}^{2} (X, Y) =

MMD_{χ}^{2} (X, Y) =

-

w\left(d,\epsilon\right)=\left\llbracket d<\epsilon\right\rrbracket\text{,}

w\left(d,\epsilon\right)=\left\llbracket d<\epsilon\right\rrbracket\text{,}

w (d, ϵ) = exp (- d^{q} / ϵ),

w (d, ϵ) = exp (- d^{q} / ϵ),

C (w) = {d : w is continuous at d} .

C (w) = {d : w is continuous at d} .

D_{\infty} (θ_{0}, θ) \in C (w (\cdot, ϵ)),

D_{\infty} (θ_{0}, θ) \in C (w (\cdot, ϵ)),

π_{m, ϵ} (θ ∣ X_{n}) \to \frac{π ( θ ) w ( D _{\infty} ( θ _{0} , θ ) , ϵ )}{\int π ( θ ) w ( D _{\infty} ( θ _{0} , θ ) , ϵ ) d θ},

π_{m, ϵ} (θ ∣ X_{n}) \to \frac{π ( θ ) w ( D _{\infty} ( θ _{0} , θ ) , ϵ )}{\int π ( θ ) w ( D _{\infty} ( θ _{0} , θ ) , ϵ ) d θ},

\mathbb{E}\left[w\left(\mathcal{D}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right),\epsilon\right)|{\color[rgb]{0,0,0}\mathcal{F}_{n}}\right]\rightarrow\mathbb{E}\left[w\left(\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right),\epsilon\right)|\mathcal{F}_{\infty}\right]\text{,}

\mathbb{E}\left[w\left(\mathcal{D}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right),\epsilon\right)|{\color[rgb]{0,0,0}\mathcal{F}_{n}}\right]\rightarrow\mathbb{E}\left[w\left(\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right),\epsilon\right)|\mathcal{F}_{\infty}\right]\text{,}

E [w (D (X_{n}, Y_{m}), ϵ) ∣ F_{n}] = L_{m, ϵ} (X_{n} ∣ θ)

E [w (D (X_{n}, Y_{m}), ϵ) ∣ F_{n}] = L_{m, ϵ} (X_{n} ∣ θ)

π (θ) w (D_{\infty} (θ_{0}, θ), ϵ),

π (θ) w (D_{\infty} (θ_{0}, θ), ϵ),

\int_{T} π (θ) w (D_{\infty} (θ_{0}, θ), ϵ) d θ .

\int_{T} π (θ) w (D_{\infty} (θ_{0}, θ), ϵ) d θ .

V_{δ} (X_{n}, Y_{m}) = i_{1} = 1 \sum n i_{2} = 1 \sum n j_{1} = 1 \sum m j_{2} = 1 \sum m \frac{κ _{δ} ( X _{i_{1},} X _{i_{2}} ; Y _{j_{1}} , Y _{j_{2}} )}{m ^{2} n ^{2}},

V_{δ} (X_{n}, Y_{m}) = i_{1} = 1 \sum n i_{2} = 1 \sum n j_{1} = 1 \sum m j_{2} = 1 \sum m \frac{κ _{δ} ( X _{i_{1},} X _{i_{2}} ; Y _{j_{1}} , Y _{j_{2}} )}{m ^{2} n ^{2}},

κ_{δ} (x_{i_{1}}, x_{i_{2}}; y_{j_{1}}, y_{j_{2}}) =

κ_{δ} (x_{i_{1}}, x_{i_{2}}; y_{j_{1}}, y_{j_{2}}) =

-

E (∣ κ_{δ} (X_{1,} X_{2}; Y_{1}, Y_{2}) ∣ lo g^{+} ∣ κ_{δ} (X_{1,} X_{2}; Y_{1}, Y_{2}) ∣) < \infty,

E (∣ κ_{δ} (X_{1,} X_{2}; Y_{1}, Y_{2}) ∣ lo g^{+} ∣ κ_{δ} (X_{1,} X_{2}; Y_{1}, Y_{2}) ∣) < \infty,

E (∥ X ∥_{2}^{2}) + E (∥ Y ∥_{2}^{2}) < \infty,

E (∥ X ∥_{2}^{2}) + E (∥ Y ∥_{2}^{2}) < \infty,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eth-cscs/abcpy
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Approximate Bayesian computation via the energy statistic

HIEN D. NGUYEN1

1Department of Mathematics and Statistics, La Trobe University, Bundoora Melbourne 3066, Victoria Australia. (e-mail: [email protected]) 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

JULYAN ARBEL2

1Department of Mathematics and Statistics, La Trobe University, Bundoora Melbourne 3066, Victoria Australia. (e-mail: [email protected]) 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

HONGLIANG LÜ2

1Department of Mathematics and Statistics, La Trobe University, Bundoora Melbourne 3066, Victoria Australia. (e-mail: [email protected]) 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

FLORENCE FORBES2

1Department of Mathematics and Statistics, La Trobe University, Bundoora Melbourne 3066, Victoria Australia. (e-mail: [email protected]) 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

Abstract

Approximate Bayesian computation (ABC) has become an essential part of the Bayesian toolbox for addressing problems in which the likelihood is prohibitively expensive or entirely unknown, making it intractable. ABC defines a pseudo-posterior by comparing observed data with simulated data, traditionally based on some summary statistics, the elicitation of which is regarded as a key difficulty. Recently, using data discrepancy measures has been proposed in order to bypass the construction of summary statistics. Here we propose to use the importance-sampling ABC (IS-ABC) algorithm relying on the so-called two-sample energy statistic. We establish a new asymptotic result for the case where both the observed sample size and the simulated data sample size increase to infinity, which highlights to what extent the data discrepancy measure impacts the asymptotic pseudo-posterior. The result holds in the broad setting of IS-ABC methodologies, thus generalizing previous results that have been established only for rejection ABC algorithms. Furthermore, we propose a consistent V-statistic estimator of the energy statistic, under which we show that the large sample result holds, and prove that the rejection ABC algorithm, based on the energy statistic, generates pseudo-posterior distributions that achieves convergence to the correct limits, when implemented with rejection thresholds that converge to zero, in the finite sample setting. Our proposed energy statistic based ABC algorithm is demonstrated on a variety of models, including a Gaussian mixture, a moving-average model of order two, a bivariate beta and a multivariate $g$ -and- $k$ distribution. We find that our proposed method compares well with alternative discrepancy measures.

1 Introduction

In recent years, Bayesian inference has become a popular paradigm for machine learning and statistical analysis. Good introductions and references to the primary methods and philosophies of Bayesian inference can be found in texts such as Press, (2003), Ghosh et al., (2006), Koch, (2007), Koop et al., (2007), Robert, (2007), Barber, (2012), Murphy, (2012). When conducting parametric Bayesian inference, we observe some realization $\bm{x}$ of the data $\bm{X}\in\mathbb{X}$ that are generated from some data generating process (DGP), which can be characterized by a parametric likelihood, given by a probability density function (PDF) $f\left(\bm{x}|\bm{\theta}\right)$ , determined entirely via the parameter vector $\bm{\theta}\in\mathbb{T}$ . Using expert knowledge, or based on computational considerations such as conjugacy, we endow the parameter $\bm{\theta}$ with some prior PDF $\pi\left(\bm{\theta}\right)$ . The goal of Bayesian inference is then to characterize the posterior distribution

[TABLE]

where the prior predictive distribution $c\left(\bm{x}\right)$ is defined by

[TABLE]

In very simple cases, such as cases when the prior PDF is a conjugate of the likelihood (cf. Sec. 3.3 of Robert, (2007)), the posterior PDF (1) can be expressed explicitly. In the case of more complex but still tractable pairs of likelihood and prior PDFs, one can sample from (1) via a variety of Monte Carlo methods, such as those reported in Ch. 6 of Press, (2003).

In cases where the likelihood function is known but not tractable, or when the likelihood function has entirely unknown form, one cannot exactly sample from (1) in an inexpensive manner, or at all. In such situations, a sample from an approximation of (1) may suffice in order to conduct the user’s desired inference. Such a sample can be drawn via the method of approximate Bayesian computation (ABC).

It is generally agreed that the ABC paradigm originated from the works of Rubin, (1984), Pritchard et al., (1999); see Tavaré, (2019) for details. Stemming from the initial listed works, there are now numerous variants of ABC methods. Some good reviews of the current ABC literature can be found in the expositions of Marin et al., (2012), Voss, (2014), Lintusaari et al., (2017), Karabatsos and Leisen, (2018). The volume Sisson et al., (2019) provides a comprehensive treatment regarding ABC methodologies.

The core philosophy of ABC is to define a pseudo-posterior by comparing data with plausibly simulated replicates. The comparison is traditionally based on some summary statistics, the choice of which being regarded as a key challenge of the approach.

In recent years, data discrepancy measures bypassing the construction of summary statistics have been proposed by viewing data sets as empirical measures. Recent examples of such an approach include the use of the maximum mean discrepancy (MMD) (Park et al.,, 2016), Kullback–Leibler divergence (Jiang et al.,, 2018), and the Wasserstein distance (Bernton et al.,, 2019). Furthermore, Jiang et al., (2018) also considered the use of the classification accuracy method of Gutmann et al., (2018), and the indirect inference method of Drovandi et al., (2015), in the data discrepancy context.

In this article, we develop upon the discrepancy measurement approach of Jiang et al., (2018), via the importance sampling ABC (IS-ABC) approach, which makes use of a weight function; see e.g. Karabatsos and Leisen, (2018). In particular, we report on a class of ABC algorithms that utilize the two-sample energy statistic (ES) of Szekely and Rizzo, (2004) (see also Baringhaus and Franz, (2004), Szekely and Rizzo, (2013, 2017), Mak and Joseph, (2018)). Our approach is related to the MMD ABC algorithms that were implemented in Park et al., (2016), Jiang et al., (2018), Bernton et al., (2019). The MMD is a discrepancy measurement that is closely related to the ES, cf. Sejdinovic et al., (2013).

We establish new asymptotic results that have not been proved in these previous papers. In the IS-ABC setting and in the regime where both the observation sample size and the simulated data sample size increase to infinity, our theoretical result highlights how the data discrepancy measure impacts the asymptotic pseudo-posterior. More specifically, we make the assumption that the data discrepancy measure converges to some asymptotic value $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)$ , where $\bm{\theta}_{0}$ stands for the ‘true’ parameter value associated to the DGP that generates observations $\bm{X}$ . We then show that the pseudo-posterior distribution converges almost surely to a distribution depending on the prior $\pi$ and on the limiting value $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)$ . In addition to our asymptotic results regarding large sample scenarios, we also provide corollaries regarding the performance of our ES-based ABC method, due to the general finite sample theoretical results of Bernton et al., (2019). Our asymptotic results provide useful approximations and guarantees for the practical application of our method.

The last decade has seen an active development in research on asymptotic properties of ABC. Early works revolved around the impact of the acceptance threshold on the ABC bias and the Monte Carlo error (Blum, 2010a, , Barber et al.,, 2015, Biau et al.,, 2015), and on the choice of summary statistics (Blum, 2010a, , Fearnhead and Prangle,, 2012, Prangle et al.,, 2014). Further works focused on large sample size properties such as consistency for model choice (Marin et al.,, 2014), asymptotic efficiency (Li and Fearnhead,, 2018), posterior consistency, and contraction rates (Frazier et al.,, 2018). It is with these results, where our article fits. Although devised in settings where likelihoods are assumed intractable, ABC can also be cast in the setting of robustness with respect to misspecification (Frazier et al.,, 2020). In particular, the ABC posterior distribution can be viewed as a special case of a coarsened posterior distribution (Miller and Dunson,, 2018).

The remainder of the article proceeds as follows. In Section 2, we introduce the general IS-ABC framework. In Section 3, we introduce the two-sample ES and demonstrate how it can be incorporated into the IS-ABC framework. Theoretical results regarding the IS-ABC framework and the two-sample ES are presented in Section 4. Illustrations of the IS-ABC framework are presented in Section 5. Conclusions are drawn in Section 6.

2 Importance sampling ABC

Assume that we observe $n$ independent and identically distributed (IID) replicates of $\bm{X}$ from some DGP, which we put into $\mathbf{X}_{n}=\left\{\bm{X}_{i}\right\}_{i=1}^{n}$ . We suppose that the DGP that generates $\bm{X}$ is dependent on some parameter vector $\bm{\theta}$ from space $\mathbb{T}$ , which is random and has prior PDF $\pi\left(\bm{\theta}\right)$ .

Denote $f\left(\bm{x}|\bm{\theta}\right)$ to be the PDF of $\bm{X}$ , given $\bm{\theta}$ , and write

[TABLE]

where $\mathbf{x}_{n}$ is a realization of $\mathbf{X}_{n}$ , and each $\bm{x}_{i}$ is a realization of $\bm{X}_{i}$ ( $i\in\left[n\right]=\left\{1,\dots,n\right\}$ ).

If $f\left(\mathbf{x}_{n}|\bm{\theta}\right)$ were known, then we could use (1) to write the posterior PDF

[TABLE]

where $c\left(\mathbf{x}_{n}\right)=\int_{\mathbb{T}}f\left(\mathbf{x}_{n}|\bm{\theta}\right)\pi\left(\bm{\theta}\right){\rm d}\bm{\theta}$ is a constant that makes $\int_{\mathbb{T}}\pi\left(\bm{\theta}|\mathbf{x}_{n}\right){\rm d}\bm{\theta}=1$ . When evaluating $f\left(\bm{x}|\bm{\theta}\right)$ is prohibitive and ABC is required, then operating with $f\left(\mathbf{x}_{n}|\bm{\theta}\right)$ is similarly difficult. We suppose that given any $\bm{\theta}\in\mathbb{T}$ , we at least have the capability of sampling from the DGP with PDF $f\left(\bm{x}|\bm{\theta}\right)$ . That is, we have a simulation method that allows us to feasibly sample the IID vector $\mathbf{Y}_{m}=\left\{\bm{Y}_{i}\right\}_{i=1}^{m}$ , for any $m\in\mathbb{N}$ , for a DGP with PDF

[TABLE]

Typically, one should choose $m=n$ , as it fulfils the hypotheses of all of our proved theoretical results. This choice is made throughout all of our numerical demonstrations. However, we anticipate that there may be practical or computational scenarios, where it may be advantageous to be able to choose $m\neq n$ , which is permissible in our methodological framework.

Using the simulation mechanism that generates samples $\mathbf{Y}_{m}$ and the prior distribution that generates parameters $\bm{\theta}$ , we can simulate a set of $N\in\mathbb{N}$ simulations $\mathbf{Z}_{N}=\left\{\bm{Z}_{m,k}\right\}_{k=1}^{N}$ , where $\bm{Z}_{m,k}^{\top}=\left(\mathbf{Y}_{m,k}^{\top},\bm{\theta}_{k}^{\top}\right)$ and $\left(\cdot\right)^{\top}$ is the transposition operator. Here, for each $k\in\left[N\right]$ , $\bm{Z}_{m,k}$ is an observation from the DGP with joint PDF $f\left(\mathbf{y}_{m}|\bm{\theta}\right)\pi\left(\bm{\theta}\right)$ , hence each $\bm{Z}_{m,k}$ is composed of a parameter value and a datum conditional on the parameter value. We now consider how $\mathbf{X}_{n}$ and $\mathbf{Z}_{N}$ can be combined in order to construct an approximation of (2).

Following the approach of Jiang et al., (2018), we define $\mathcal{D}\left(\mathbf{x}_{n},\mathbf{y}_{m}\right)$ to be some non-negative real-valued function that outputs a small value if $\mathbf{x}_{n}$ and $\mathbf{y}_{m}$ are similar, and outputs a large value if $\mathbf{x}_{n}$ and $\mathbf{y}_{m}$ are different, in some sense. We call $\mathcal{D}\left(\mathbf{x}_{n},\mathbf{y}_{m}\right)$ the data discrepancy measurement between $\mathbf{x}_{n}$ and $\mathbf{y}_{m}$ , and we say that $\mathcal{D}\left(\cdot,\cdot\right)$ is the data discrepancy function.

Next, we let $w\left(d,\epsilon\right)$ be a non-negative, decreasing (in $d$ ), and bounded (importance sampling) weight function (cf. Section 3 of Karabatsos and Leisen, (2018)), which takes as inputs a data discrepancy measurement $d=\mathcal{D}\left(\mathbf{x}_{n},\mathbf{y}_{m}\right)\geq 0$ and a calibration parameter $\epsilon>0$ . Using the weight and discrepancy functions, we can propose the following approximation for (2).

In the language of Jiang et al., (2018), we call

[TABLE]

the pseudo-posterior PDF, where

[TABLE]

is the approximate likelihood function, and

[TABLE]

is a normalization constant. We can use (3) to approximate (2) in the following way. For any functional of the parameter vector $\bm{\theta}$ of interest, $g\left(\bm{\theta}\right)$ say, we may approximate the posterior mean Bayesian estimator of $g\left(\bm{\theta}\right)$ via the expression

[TABLE]

where the right-hand side of (4) can be unbiasedly estimated using $\mathbf{Z}_{N}$ via

[TABLE]

We call the process of constructing (5), to approximate (4), the IS-ABC procedure. The general form of the IS-ABC procedure is provided in Algorithm 1.

Algorithm 1.

IS-ABC procedure for approximating $\mathbb{E}\left[g\left(\bm{\theta}\right)|\mathbf{x}_{n}\right]$ .

Input: a data discrepancy function $\mathcal{D}$ , a weight function $w$ , and a calibration parameter $\epsilon>0$ .

For $k\in\left[N\right]$ ;

sample $\bm{\theta}_{k}$ from PDF $\pi\left(\bm{\theta}\right)$ ;

generate $\mathbf{Y}_{m,k}$ from the DGP with PDF $f\left(\mathbf{y}_{m}|\bm{\theta}_{k}\right)$ ;

put $\bm{Z}_{k}=\left(\mathbf{Y}_{m,k},\bm{\theta}_{k}\right)$ into $\mathbf{Z}_{N}$ .

Output: $\mathbf{Z}_{N}$ and construct the estimator $\mathbb{M}\left[g\left(\bm{\theta}\right)|\mathbf{x}_{n}\right]$ .

3 The energy statistic (ES)

Let $\delta$ define a metric and let $\bm{X}\in\mathbb{X}\subseteq\mathbb{R}^{d}$ and $\bm{Y}\in\mathbb{X}$ be two random variables that are in a space endowed with a semi-metric $\delta$ , where $d\in\mathbb{N}$ (cf. Sejdinovic et al., (2013)). Furthermore, let $\bm{X}^{\prime}$ and $\bm{Y}^{\prime}$ be two random variables that have the same distributions as $\bm{X}$ and $\bm{Y}$ , respectively. Here, $\bm{X}$ , $\bm{X}^{\prime}$ , $\bm{Y}$ , and $\bm{Y}^{\prime}$ are all independent of one another.

Upon writing

[TABLE]

we can define the original ES of Baringhaus and Franz, (2004) and Szekely and Rizzo, (2004), as a function of $\bm{X}$ and $\bm{Y}$ , via the expression $\mathcal{E}_{\delta_{1}}\left(\bm{X},\bm{Y}\right)$ , where $\delta_{\beta}\left(\bm{x},\bm{y}\right)=\left\|\bm{x}-\bm{y}\right\|_{2}^{\beta}$ is the $\beta$ power of the metric corresponding to the $L_{2}\text{-norm}$ ( $\beta\in\left(0,2\right]$ ; cf. (Szekely and Rizzo,, 2013, Prop. 2)). Thus, the original ES statistic, which we shall also denote as $\mathcal{E}\left(\bm{X},\bm{Y}\right)$ , is defined using the Euclidean metric $\delta_{1}$ .

The original ES has numerous useful mathematical properties. For instance, under the assumption that $\mathbb{E}\left\|\bm{X}\right\|_{2}+\mathbb{E}\left\|\bm{Y}\right\|_{2}<\infty$ , it was shown that

[TABLE]

in Proposition 1 of Szekely and Rizzo, (2013), where $\Gamma\left(\cdot\right)$ is the gamma function and $\varphi_{X}$ (respectively, $\varphi_{Y}$ ) is the characteristic function of $\bm{X}$ (respectively, $\bm{Y}$ ). Thus, we have the fact that $\mathcal{E}\left(\bm{X},\bm{Y}\right)\geq 0$ for any $\bm{X},\bm{Y}\in\mathbb{X}$ , and $\mathcal{E}\left(\bm{X},\bm{Y}\right)=0$ if and only if $\bm{X}$ and $\bm{Y}$ are identically distributed.

The result above is generalized in Proposition 3 of Szekely and Rizzo, (2013), where we have the following statement. If $\delta\left(\bm{x},\bm{y}\right)=\delta\left(\bm{x}-\bm{y}\right)$ is a continuous function and $\bm{X},\bm{Y}\in\mathbb{R}^{d}$ are independent random variables, then it is necessary and sufficient that $\delta\left(\cdot\right)$ is strictly negative definite (see Szekely and Rizzo, (2013) for the precise definition) for the following conclusion to hold: $\mathcal{E}_{\delta}\left(\bm{X},\bm{Y}\right)\geq 0$ for any $\bm{X},\bm{Y}\in\mathbb{X}$ , and $\mathcal{E}_{\delta}\left(\bm{X},\bm{Y}\right)=0$ if and only if $\bm{X}$ and $\bm{Y}$ are identically distributed.

We observe that there is thus an infinite variety of functions $\delta$ from which we can construct energy statistics. We shall concentrate on the use of the original ES, based on $\delta_{1}$ , since it is the most well known and popular of the varieties.

3.1 The V-statistic estimator

Suppose that we observe $\mathbf{X}_{n}=\left\{\bm{X}_{i}\right\}_{i=1}^{n}$ and $\mathbf{Y}_{m}=\left\{\bm{Y}_{i}\right\}_{i=1}^{m}$ , where the former is a sample containing $n$ IID replicates of $\bm{X}$ , and the latter is a sample containing $m$ IID replicates of $\bm{Y}$ , respectively, with $\mathbf{X}_{n}$ and $\mathbf{Y}_{m}$ being independent. In Gretton et al., (2012), it was shown that for any $\delta$ , upon assuming that $\delta\left(\bm{x},\bm{y}\right)<\infty$ , the so-called V-statistic estimator (cf. (Serfling,, 1980, Ch. 5) and Koroljuk and Borovskich, (1994))

[TABLE]

can be proved to converge in probability to $\mathcal{E}_{\delta}\left(\bm{X},\bm{Y}\right)$ , as $n\rightarrow\infty$ and $m\rightarrow\infty$ , under the condition that $m/n\rightarrow\alpha<\infty$ , for some constant $\alpha$ (see also Gretton et al., (2007)). Here, the proof was provided in the context of MMDs (see definition in Section 3.3) but is easily portable to the ES setting.

We note that the assumption of this result is rather restrictive, since it either requires the bounding of the space $\mathbb{X}$ or the function $\delta$ . In the sequel, we will present a result for the almost sure convergence of the V-statistic that depends on the satisfaction of a more realistic hypothesis.

It is noteworthy that if the ES is non-negative, then the V-statistic retains the non-negativity property of its corresponding ES (cf. Gretton et al., (2012)). That is, for any continuous and negative definite function $\delta\left(\bm{x},\bm{y}\right)=\delta\left(\bm{x}-\bm{y}\right)$ , we have $\mathcal{V}_{\delta}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)\geq 0$ .

3.2 The ES-based IS-ABC algorithm

From Algorithm 1, we observe that an IS-ABC algorithm requires three components. A data discrepancy measurement $d=\mathcal{D}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)\geq 0$ , a weighting function $w\left(d,\epsilon\right)\geq 0$ , and a tuning parameter $\epsilon>0$ . We propose the use of the ES in the place of the data discrepancy measurement $d$ , in combination with various weight functions that have been used in the literature. That is we set

[TABLE]

in Algorithm 1.

In particular, we consider original ES, where $\delta=\delta_{1}$ . We name our framework the ES-ABC algorithm. In Section 4, we shall demonstrate that the proposed algorithm possesses desirable large sample qualities that guarantees its performance in practice, as illustrated in Section 5.

3.3 Related methods

The ES-ABC algorithm that we have presented here is closely related to ABC algorithms based on the maximum mean discrepancy (MMD) that were implemented in Park et al., (2016), Jiang et al., (2018), and Bernton et al., (2019). For each Mercer kernel function $\chi\left(\bm{x},\bm{y}\right)$ ( $\bm{x},\bm{y}\in\mathbb{X})$ , the corresponding MMD is defined via the equation

[TABLE]

where $\bm{X},\bm{X}^{\prime},\bm{Y},\bm{Y}^{\prime}$ are random variable such that $\bm{X}$ and $\bm{Y}$ are identically distributed to $\bm{X}^{\prime}$ and $\bm{Y}^{\prime}$ , respectively.

The MMD as a statistic for testing goodness-of-fit was studied prominently in articles such as Gretton et al., (2007), Gretton et al., (2009), and Gretton et al., (2012). More details regarding the relationship between the two classes of statistics can be found in Sejdinovic et al., (2013).

We note two shortcomings with respect to the applications of the MMD as a basis for an ABC algorithm in the previous literature. Firstly, no theoretical results regarding the consistency of the MMD-based methods have been proved. And secondly, in the application by Park et al., (2016) and Jiang et al., (2018), the MMD was implemented using the unbiased U-statistic estimator, rather than the biased V-statistic estimator. Although both estimators are consistent, in the sense that they can be proved to be convergent to the desired limiting MMD value, the U-statistic estimator has the property of not being bounded from below by zero (cf. Gretton et al., (2012)). As such, it does not meet the strict definition of a data discrepancy measurement.

For a sufficiently large sample size, the U-statistic will have low probability of having a value less than zero, and thus the difference between the U-statistic and V-statistic becomes immaterial for large $n$ . One may also consider a truncation of the U-statistic, which causes no issues, asymptotically, as the U-statistic and V-statistic have the same limit, which is guaranteed to be non-negative.

4 Theoretical results

4.1 Behavior as ${n}\to\infty$ and ${m}\to\infty$

4.1.1 Analysis with a generic discrepancy

We now establish a consistency result for the pseudo-posterior density (3), when $n$ and $m$ approach infinity. Our result generalizes the main result of Jiang et al., (2018) (i.e., Theorem 1), which is the specific case when the weight function is restricted to the form

[TABLE]

where $\left\llbracket\cdot\right\rrbracket$ is the Iverson bracket notation, which equals 1 when the internal statement is true, and 0, otherwise (cf. Graham et al., (1994)).

The weighting function of form (8), when implemented within the IS-ABC framework, produces the common rejection ABC algorithms, that were suggested by Tavaré et al., (1997), and Pritchard et al., (1999). We extended upon the result of Jiang et al., (2018) so that we may provide theoretical guarantees for more exotic ABC procedures, such as the kernel-smoothed ABC procedure of Park et al., (2016), which implements weights of the form

[TABLE]

for $q>0$ . See Karabatsos and Leisen, (2018) for further discussion and examples.

In order to prove our asymptotic result, we require Hunt’s lemma, which is reported in Dellacherie and Meyer, (1980), as Theorem 45 of Section V.5. For convenience to the reader, we present the result, below.

Theorem 1.

Let $\left(\Omega,\mathcal{F},\mathbb{P}\right)$ be a probability space with increasing $\sigma\text{-fields}$ $\left\{\mathcal{F}_{n}\right\}$ and let $\mathcal{F}_{\infty}=\cup_{n}\mathcal{F}_{n}$ . Suppose that $\left\{U_{n}\right\}$ is a sequence of random variables that is bounded from above in absolute value by some integrable random variable $V$ , and further suppose that $U_{n}$ converges almost surely to the random variable $U$ . Then, $\lim_{n\rightarrow\infty}\mathbb{E}\left(U_{n}|\mathcal{F}_{n}\right)=\mathbb{E}\left(U|\mathcal{F}_{\infty}\right)$ almost surely, and in $\mathcal{L}_{1}$ mean, as $n\rightarrow\infty$ .

Define the continuity set of a function $d\mapsto w\left(d\right)$ as

[TABLE]

Using Theorem 1, we can now prove the following result regarding the asymptotic behavior of the pseudo-posterior density function (3).

Theorem 2.

Let $\mathbf{X}_{n}$ and $\mathbf{Y}_{m}$ be IID samples from DGPs that can be characterized by PDFs $f\left(\mathbf{x}_{n}|\bm{\theta}_{0}\right)=\prod_{i=1}^{n}f\left(\bm{x}_{i}|\bm{\theta}_{0}\right)$ and $f\left(\mathbf{y}_{m}|\bm{\theta}\right)=\prod_{i=1}^{m}f\left(\bm{y}_{i}|\bm{\theta}\right)$ , respectively, with corresponding parameter vectors $\bm{\theta}_{0}$ and $\bm{\theta}$ . Suppose that the data discrepancy $\mathcal{D}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)$ converges to some $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)$ , which is a function of $\bm{\theta}_{0}$ and $\bm{\theta}$ , almost surely as $n\rightarrow\infty$ , for some $m=m\left(n\right)\rightarrow\infty$ . If $w\left(d,\epsilon\right)$ is piecewise continuous and decreasing in $d$ and $w\left(d,\epsilon\right)\leq a<\infty$ for all $d\geq 0$ and any $\epsilon>0$ , and if

[TABLE]

then we have

[TABLE]

almost surely, as $n\rightarrow\infty$ .

Proof.

Using the notation of Theorem 1, we set $U_{n}=w\left(d\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right),\epsilon\right)$ . Since $w\left(d,\epsilon\right)\leq a<\infty$ , for any $d$ , we have the existence of a $\left|U_{n}\right|\leq V<\infty$ such that $V$ is integrable, since we can take $V=a$ . Since $\mathcal{D}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)$ converges almost surely to $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)$ , and $w\left(\cdot,\epsilon\right)$ is continuous at $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)$ , we have $U_{n}\rightarrow U=w\left(\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right),\epsilon\right)$ with probability one by the extended continuous mapping theorem (cf. (DasGupta,, 2011, Thm. 7.10)).

Now, let $\mathcal{F}_{n}$ be the $\sigma\text{-field}$ generated by the sequence $\left\{\bm{X}_{1},\dots,\bm{X}_{n}\right\}$ . Thus, $\mathcal{F}_{n}$ is an increasing $\sigma\text{-field}$ , which approaches $\mathcal{F}_{\infty}=\cup_{n}\mathcal{F}_{n}$ . We are in a position to directly apply Theorem 1. This yields

[TABLE]

almost surely, as $n\rightarrow\infty$ , where the right-hand side equals $w\left(\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right),\epsilon\right)$ .

Notice that the left-hand side has the form

[TABLE]

and therefore $L_{m,\epsilon}\left(\mathbf{X}_{n}|\bm{\theta}\right)\rightarrow w\left(\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right),\epsilon\right)$ , almost surely, as $n\rightarrow\infty$ . Thus, the numerator of (3) converges to

[TABLE]

almost surely.

To complete the proof, it suffices to show that the denominator of (3) converges almost surely to

[TABLE]

Since $L_{m,\epsilon}\left(\mathbf{X}_{n}|\bm{\theta}\right)\rightarrow w\left(\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right),\epsilon\right)$ and $c_{m,\epsilon}\left(\mathbf{x}_{n}\right)=\int_{\mathbb{T}}\pi\left(\bm{\theta}\right)L_{m,\epsilon}\left(\mathbf{x}_{n}|\bm{\theta}\right){\rm d}\bm{\theta}$ , we obtain our desired convergence via the dominated convergence theorem, because $w\left(d,\epsilon\right)\leq a<\infty$ . An application of a continuous mapping theorem (cf. (DasGupta,, 2011, Thm. 7.8)) yields the almost sure convergence of the ratio between (11) and (12) to the right-hand side of (10), as $n\rightarrow\infty$ . ∎

The following result and proof guarantees the applicability of Theorem 2 to rejection ABC procedures, and to kernel-smoothed ABC procedures, as used in Jiang et al., (2018) and Park et al., (2016), respectively.

Proposition 1.

The result of Theorem 2 applies to rejection ABC and importance sampling ABC, with weight functions of respective forms (8) and (9).

Proof.

For weights of form (8), we note that $w\left(d,\epsilon\right)=\left\llbracket d<\epsilon\right\rrbracket$ is continuous in $d$ at all points, other than when $d=\epsilon$ . Furthermore, $w\left(d,\epsilon\right)\in\left\{0,1\right\}$ and is hence non-negative and bounded. Thus, under the condition that $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)\neq\epsilon$ , we have the desired conclusion of Theorem 2.

For weights of form (9), we note that for fixed $\epsilon$ , $w\left(d,\epsilon\right)$ is continuous and positive in $d$ . Since $w$ is uniformly bounded by 1, differentiating with respect to $d$ , we obtain ${\rm d}w/{\rm d}d=-\left(q/\epsilon\right)d^{q-1}\exp\left(-d^{q}/\epsilon\right)$ , which is negative for any $d\geq 0$ and $q>0$ . Thus, (9) constitutes a weight function and satisfies the conditions of Theorem 2. ∎

4.1.2 Analysis with the energy statistic

We write $\log^{+}x=\log\left(\max\left\{1,x\right\}\right)$ . From Szekely and Rizzo, (2004) we have the fact that for arbitrary $\delta$ ,

[TABLE]

where

[TABLE]

is the kernel of the V-statistic that is based on the function $\delta$ . The following result is a direct consequence of Theorem 1 of Sen, (1977), when applied to V-statistics constructed from functionals $\delta$ that satisfy the hypothesis of (Szekely and Rizzo,, 2013, Prop. 3).

Lemma 1.

Make the same assumptions regarding $\mathbf{X}_{n}$ and $\mathbf{Y}_{m}$ as in Theorem 2. Let $\delta\left(\bm{x},\bm{y}\right)=\delta\left(\bm{x}-\bm{y}\right)$ be a continuous and strictly negative definite function. If

[TABLE]

then $\mathcal{V}_{\delta}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)$ converges almost surely to $\mathcal{E}_{\delta}\left(\bm{X}_{1},\bm{Y}_{1}\right)\geq 0$ , as $\min\left\{n,m\right\}\rightarrow\infty$ , where $\bm{X}_{1},\bm{X}_{2}\in\mathbb{X}$ and $\bm{Y}_{1},\bm{Y}_{2}\in\mathbb{X}$ are arbitrary elements of $\mathbf{X}_{n}$ and $\mathbf{Y}_{m}$ , respectively.

We may apply the result of Lemma 1 directly to the case of $\delta=\delta_{1}$ in order to provide an almost sure convergence result regarding the V-statistic $\mathcal{V}_{\delta_{1}}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)$ .

Corollary 1.

Make the same assumptions regarding $\mathbf{X}_{n}$ and $\mathbf{Y}_{m}$ as in Theorem 2. If $\bm{X}\in\mathbb{X}$ and $\bm{Y}\in\mathbb{X}$ are arbitrary elements of $\mathbf{X}_{n}$ and $\mathbf{Y}_{m}$ , respectively, and

[TABLE]

and if $\min\left\{n,m\right\}\rightarrow\infty$ , then $\mathcal{V}_{\delta_{1}}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)$ converges almost surely to

[TABLE]

where $\varphi\left(\bm{t};\bm{\theta}\right)$ is the characteristic function corresponding to the PDF $f\left(\bm{y};\bm{\theta}\right)$ .

Proof.

By the law of total expectation, we apply Lemma 1 by considering the two cases of (13): when $\left|\kappa_{\delta_{1}}\right|\leq 1$ and when $\left|\kappa_{\delta_{1}}\right|>1$ , separately, to write

[TABLE]

where $p_{0}=\mathbb{P}\left(\left|\kappa_{\delta_{1}}\right|\leq 1\right)$ and $p_{1}=\mathbb{P}\left(\left|\kappa_{\delta_{1}}\right|>1\right)$ . The first term on the right-hand side of (16) is equal to zero, since $\log^{+}\left|\kappa_{\delta_{1}}\right|=\log\left(1\right)=0$ , whenever $\left|\kappa_{\delta_{1}}\right|\leq 1$ . Thus, we need only be concerned with bounding the second term.

For $\left|\kappa_{\delta_{1}}\right|>1$ , $\left|\kappa_{\delta_{1}}\right|\log\left|\kappa_{\delta_{1}}\right|\leq\left|\kappa_{\delta_{1}}\right|^{2}$ , thus

[TABLE]

The condition that $\mathbb{E}\left(\left|\kappa_{\delta_{1}}\right|\log^{+}\left|\kappa_{\delta_{1}}\right|\right)<\infty$ is thus fulfilled if $\mathbb{E}\left(\left|\kappa_{\delta_{1}}\right|^{2}|\left|\kappa_{\delta_{1}}\right|>1\right)<\infty$ , which is equivalent to

[TABLE]

by virtue of the integrability of $\left\{\left|\kappa_{\delta_{1}}\right|^{2}|\left|\kappa_{\delta_{1}}\right|\leq 1\right\}$ implying the existence of

[TABLE]

since it is defined on a bounded support.

Next, by the triangle inequality,

[TABLE]

and hence

[TABLE]

Since $\bm{X}_{1},\bm{X}_{2},\bm{Y}_{1},\bm{Y}_{2}$ are all pairwise independent, and $\bm{X}_{1}$ and $\bm{Y}_{1}$ are identically distributed to $\bm{X}_{2}$ and $\bm{Y}_{2}$ , respectively, we have

[TABLE]

which concludes the proof since $\mathbb{E}\left\|\bm{X}_{1}\right\|^{2}_{2}+\mathbb{E}\left\|\bm{Y}_{1}\right\|^{2}_{2}<\infty$ is satisfied by the hypothesis and implies $\mathbb{E}\left\|\bm{X}_{1}\right\|_{2}+\mathbb{E}\left\|\bm{Y}_{1}\right\|_{2}<\infty$ . ∎

We note that condition (14) is stronger than a direct application of condition (13), which may be preferable in some situations. However, condition (14) is somewhat more intuitive and verifiable since it is concerned with the polynomial moments of norms and does not involve the piecewise function $\log^{+}x$ . It is also suggested in Zygmund, (1951) that one may replace $\log^{+}x$ by $\log\left(2+x\right)$ if it is more convenient to do so. We further note that (14) is required for establishing almost sure convergence, and is stronger than what is needed to ensure convergence in probability, as is established in Szekely and Rizzo, (2004) and Gretton et al., (2012).

Combining the result of Theorem 2 with Corollary 1 and the conclusion from Proposition 1 of Szekely and Rizzo, (2013) provided in Equation (15) yields the key result below. This result justifies the use of the V-statistic estimator $\mathcal{V}_{\delta_{1}}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)$ for the energy distance $\mathcal{E}\left(\bm{X},\bm{Y}\right)$ within the IS-ABC framework, and is comparable to Corollaries 1–3 of Jiang et al., (2018) regarding the large sample asymptotics of other discrepancy measurements.

Corollary 2.

Under the assumptions of Corollary 1. If $\mathcal{D}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)=\mathcal{V}_{\delta_{1}}\left(\mathbf{X}_{n},\mathbf{Y}_{m}\right)$ , then the conclusion of Theorem 2 follows with

[TABLE]

almost surely, as $n\rightarrow\infty$ , where $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)\geq 0$ and $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)=0$ , if and only if $\bm{\theta}_{0}=\bm{\theta}$ .

4.2 Behavior as $\mathbf{\epsilon}\to\mathbf{0}$

Let $\mathbb{F}$ be the set of probability distributions on $\mathbb{X}$ . From (Sejdinovic et al.,, 2013, Thm. 22), we have the fact that $\mathcal{E}^{1/2}\left(\bm{X},\bm{Y}\right)=\mathcal{E}^{1/2}\left(F_{X},F_{Y}\right)$ is a metric on $\mathbb{F}$ , where $\bm{X}$ and $\bm{Y}$ have data generating process that are characterized by $F_{X}$ and $F_{Y}$ , respectively. As such, when we take $\bm{X}$ and $\bm{Y}$ arising from two empirical distributions with an equal number of masses (defined on $\mathbf{x}_{n}$ and $\mathbf{y}_{n}$ , for instance), then we obtain the fact that $\mathcal{E}^{1/2}\left(\bm{X},\bm{Y}\right)=0$ if and only if the two empirical distributions are the same. In other words, $\mathbf{x}_{n}$ and $\mathbf{y}_{n}$ are equal, in the sense that the elements of $\mathbf{x}_{n}$ and $\mathbf{y}_{n}$ are equal up to a permutation. Proposition 2 of Bernton et al., (2019) then provides the following result in the case when $n$ is fixed.

Proposition 2.

Assume that $w\left(d,\epsilon\right)$ has form (8), where $d=\mathcal{V}_{\delta_{1}}$ , and that $f\left(\mathbf{x}_{n}|\bm{\theta}\right)$ is a continuous and exchangeable PDF. Furthermore, assume that

[TABLE]

and suppose that there exists some $\bar{\epsilon}>0$ , where

[TABLE]

Then, for fixed $\mathbf{x}_{n}$ , the pseudo-posterior PDF (3) converges strongly to the posterior PDF (2), as $\epsilon\rightarrow 0$ .

Let us suppose that the empirical distribution of $\bm{X}_{n}$ is denoted $\hat{F}_{n}$ and that each observation of $\bm{X}_{n}$ is generated from a process that can be characterized by the distribution $F_{0}$ (we write the joint distribution of $\bm{X}_{n}$ as $F_{n}$ ). We shall also write $F_{n}^{\bm{\theta}}$ as the probability distribution corresponding to the PDF $f\left(\mathbf{x}_{n}|\bm{\theta}\right)$ , and $\hat{F}_{n}^{\bm{\theta}}$ as the empirical distribution obtained from a sample $\bm{Y}_{n}$ with data generating process that is characterized by $F_{n}^{\bm{\theta}}$ .

Next, we let the probability distribution corresponding to the prior and pseudo-posterior PDFs of the ES-based ABC process with rejection weights (i.e. $\pi(\bm{\theta})$ and (3)) as $\Pi$ and $\Pi_{n}^{\epsilon}$ , respectively. And finally, let us denote the probability of the set $\mathbb{A}$ with respect to the probability distribution $F$ as $F\left(\mathbb{A}\right)$ . In order to state our next result, we require the following assumptions.

A1

The data generating process of $\bm{X}_{n}$ is such that, for every $\varepsilon>0$ ,

[TABLE]

A2

For every $\epsilon>0$ ,

[TABLE]

where $s_{n}\left(\epsilon\right)$ is a sequence of functions that is strictly decreasing in $\epsilon$ for all $n$ , and $s_{n}\left(\epsilon\right)\rightarrow 0$ as $n\rightarrow\infty$ , for fixed $\epsilon$ . Here: $c\left(\bm{\theta}\right)$ is a positive function that is integrable with respect to $\Pi$ and satisfies $c\left(\bm{\theta}\right)\leq c_{0}$ for some $c_{0}$ , for all $\bm{\theta}$ such that, for some $\delta_{0}>0$ , $\mathcal{E}^{1/2}\left(F_{0},F_{1}^{\bm{\theta}}\right)\leq\delta_{0}+\epsilon_{0}$ , where $\epsilon_{0}=\min_{\bm{\theta}\in\mathbb{T}}\mathcal{E}^{1/2}\left(F_{0},F_{1}^{\bm{\theta}}\right)$ .

A3

There exists an $L>0$ and a $c_{\pi}>0$ such that, for each sufficiently small $\epsilon>0$ ,

[TABLE]

Upon making Assumptions A1–A3, we may apply the proof process of (Bernton et al.,, 2019, Prop. 3) directly, replacing the Wasserstein metric with the energy metric $\mathcal{E}^{1/2}$ , where appropriate. Such a process yields the following result.

Proposition 3.

Along with A1–A3, assume that there exists a sequence $\left\{\epsilon_{n}\right\}_{n=1}^{\infty}$ , such that $\epsilon_{n}\rightarrow 0$ , $s_{n}\left(\epsilon_{n}\right)\rightarrow 0$ and $F_{n}\left(\mathcal{E}^{1/2}\left(\hat{F}_{n},F_{0}\right)\leq\epsilon_{n}\right)\rightarrow 1$ , as $n\rightarrow\infty$ . Then, the ES-based ABC algorithm with $m=n$ , discrepancy $d=\mathcal{V}_{\delta_{1}}^{1/2}$ , and rejection weights using $\epsilon=\epsilon_{n}+\epsilon_{0}$ satisfies the inequality

[TABLE]

for some $C,R\in\left(0,\infty\right)$ , with probability going to 1 as $n\rightarrow\infty$ (with respect to $F_{0}$ ).

The hypotheses of Proposition 2 are straight forward and the conclusion implies that pseudo-posterior PDF of the ES-based ABC procedure can approximate the posterior PDF, based on the likelihood of the data generating process of $\mathbf{x}_{n}$ , to an arbitrary level of accuracy, when $\epsilon$ is made sufficiently small. This however does not mean that one should make $\epsilon$ too small in practice, as the effort required to simulate data will become more difficult and the process becomes more computationally intensive in such cases. Note that the value of $\epsilon$ is often chosen in a pragmatic way as a quantile (of a small order, usually less than 5%) of all the distances that are obtained in the ABC sample, thus deciding how many samples are kept as a fraction of the entire ABC replications. This procedure was used in the ABC algorithms of Beaumont et al., (2002), Blum, 2010b , and Jabot et al., (2013).

Next, we note that the assumptions (other than A1, which is generally true for stationary and ergodic data; cf. (Szekely and Rizzo,, 2013, Secs. 7 and 8)) and the conclusion of Proposition 3 are more complex. Due to the lack of ease by which A2 and A3 may be validated, the proposition is more useful as an existence result regarding what can be expected in theory, with respect to how quickly the ES-based ABC algorithm converges in $n$ , rather than providing any practical guidance. A suggestion by Bernton et al., (2019) is that one may potentially apply the theory of Fournier and Guillin, (2015) and Weed et al., (2019) in order to validate assumption A2.

Under further assumptions, the concentration with respect to the discrepancy in distributions can be transferred to a concentration result, with respect to parameter vector in the space $\mathbb{T}$ (cf. (Bernton et al.,, 2019, Cor. 1)).

4.3 Illustration on a simple example

We use $\bm{X}\sim\mathcal{L}$ to denote that the random variable $\bm{X}$ has probability law $\mathcal{L}$ . Furthermore, we denote the normal law by $\mathcal{N}\left(\bm{\mu},\bm{\Sigma}\right)$ , where $\bm{X}\sim\mathcal{N}\left(\bm{\mu},\bm{\Sigma}\right)$ states that the DGP of $\bm{X}$ is multivariate normal distribution with mean vector $\bm{\mu}$ and covariance matrix $\bm{\Sigma}$ .

For illustrating the theoretical results, we investigate the pseudo-posterior limit on a simple univariate Gaussian location model $\mathcal{N}(\bm{\theta},\sigma^{2})$ (with known variance $\sigma^{2}$ ) with conjugate Gaussian prior $\bm{\theta}\sim\mathcal{N}(0,\tau^{2})$ (with variance $\tau^{2}$ fixed). We have IID observations $\bm{X}_{1},\dots,\bm{X}_{n}\mid\bm{\theta}_{0}\sim\mathcal{N}(\bm{\theta}_{0},\sigma^{2})$ , and IID replicates $\bm{Y}_{1},\dots,\bm{Y}_{m}\mid\bm{\theta}\sim\mathcal{N}(\bm{\theta},\sigma^{2})$ . The posterior is $\bm{\theta}\mid\bm{X}_{1},\dots,\bm{X}_{n}\sim\mathcal{N}(\hat{\bm{\theta}},\hat{\sigma}^{2})$ , where

[TABLE]

In this simple model, the limiting data discrepancy takes the form (up to a proportionality constant) of $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)=(\bm{\theta}_{0}-\bm{\theta})^{2}$ for the energy distance and Kullback–Leibler divergence, and $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)=|\bm{\theta}_{0}-\bm{\theta}|$ for the MMD and the (second order) Wasserstein distance.

Theorem 2 establishes that the large $n$ and $m$ limit of the pseudo-posterior $\pi_{m,\epsilon}$ is the distribution that we denote here by $\pi_{\infty,\epsilon}\left(\bm{\theta}\right)\propto\pi\left(\bm{\theta}\right)w\left(\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right),\epsilon\right)$ . For illustrative purposes, let us focus on the case when $\mathcal{D}_{\infty}\left(\bm{\theta}_{0},\bm{\theta}\right)=|\bm{\theta}_{0}-\bm{\theta}\mid$ , and consider rejection ABC with $w\left(d,\epsilon\right)=\left\llbracket d<\epsilon\right\rrbracket$ and IS-ABC with $w\left(d,\epsilon\right)=\exp\left(-d^{2}/2\epsilon^{2}\right)$ . The limiting pseudo-posterior can then be obtained in closed-form as

[TABLE]

a truncated Gaussian for rejection ABC and

[TABLE]

for IS-ABC, where $\bar{\bm{\theta}}(\epsilon)=\frac{\bm{\theta}_{0}}{1+\epsilon^{2}/\tau^{2}}$ and $\bar{\sigma}^{-2}(\epsilon)=\tau^{-2}+\epsilon^{-2}$ . See Figure 1 for an illustration, for various values of $\epsilon$ .

5 Illustrations

We illustrate the use of the ES on some standard models. The standard rejection ABC algorithm is employed (that is, we use Algorithm 1 with weight function $w$ of form (8)) for constructing estimators (5). The proposed ES is compared to the Kullback–Leibler divergence (KL), the Wasserstein distance (WA), and the maximum mean discrepancy (MMD). Here, the ES is applied using the Euclidean metric $\delta_{1}$ , the Wasserstein distance using the exponent $p=2$ and the approximation by the swapping distance (Bernton et al.,, 2019) and the MMD using a Gaussian kernel $\chi(\bm{x},\bm{y})=\exp{[-(\bm{x}-\bm{y})^{2}]}$ . The Gaussian kernel is commonly used in the MMD literature, and was also considered for ABC in Park et al., (2016) and Jiang et al., (2018). Details regarding the use of the Kullback–Leibler divergence as a discrepancy function for ABC algorithms can be found in Sec. 2 of Jiang et al., (2018). With respect to the theoretical results of Section 4, the chosen examples can be shown to be sufficiently regular as to validate the hypotheses of Corollary 2 and Proposition 2. However, we believe that it would be difficult to validate Assumptions A2 and A3 of Proposition 3, without further theoretical development.

We consider examples explored in (Jiang et al.,, 2018, Sec. 4.1). For each illustration below, we sample synthetic data of the same size $m$ as the observed data size, $n$ , whose value is specified for each model below. The ABC procedure is sensitive to the choice of the prior; we follow the benchmark examples of Jiang et al., (2018) by employing the same uniform priors, as specified in each example. The number of ABC iterations in Algorithm 1 is set to $N=10^{5}$ . The tuning parameter $\epsilon$ is set so that only the $0.05\%$ smallest discrepancies are kept to form ABC posterior sample. We postpone the discussion of the results of our simulation experiments to Section 5.5

The experiments were implemented in R, using in particular the winference package (Bernton et al.,, 2019) and the FNN package (Beygelzimer et al.,, 2013). The Kullback–Leibler divergence between two PDFs is computed within the $1$ -nearest neighbor framework (Boltz et al.,, 2009). Moreover, the $k$ -d trees is adopted for implementing the nearest neighbor search, which is the same as the method of Jiang et al., (2018). For estimating the $2$ -Wasserstein distance between two multivariate empirical measures, we propose to employ the swapping algorithm (Puccetti,, 2017), which is simple to implement, and is more accurate and less computationally expensive than other algorithms commonly used in the literature (Bernton et al.,, 2019). Regarding the MMD, the same unbiased U-statistic estimator is adopted as given in Jiang et al., (2018) and Park et al., (2016). For reproduction of the the experimental results, the original source code can be accessed at https://github.com/hiendn/Energy_Statistics_ABC.

5.1 Bivariate Gaussian mixture model

Let $\mathbf{X}_{n}$ be a sequence of IID random variables, such that each $\bm{X}_{i}$ has a mixture of bivariate Gaussian probability law

[TABLE]

with known covariance matrices

[TABLE]

We aim to estimate the generative parameters $\bm{\theta}^{\top}=(p,\bm{\mu}_{0}^{\top},\bm{\mu}_{1}^{\top})$ consisting of the mixing probability $p$ and the population means $\bm{\mu}_{0}$ and $\bm{\mu}_{1}$ . We denote the uniform law, in the interval $(a,b)$ , for $a<b$ , by $\text{Unif}(a,b)$ . The priors on the model parameters are uniform; that is, $\bm{\mu}_{1}\sim\text{Unif}(-1,1)^{2}$ , $\bm{\mu}_{2}\sim\text{Unif}(-1,1)^{2}$ and $p\sim\text{Unif}(0,1)$ . We perform ABC using $n=500$ observations, sampled from model (19) with $p=0.3$ , $\bm{\mu}_{0}^{\top}=(0.7,0.7)$ and $\bm{\mu}_{1}^{\top}=(-0.7,-0.7)$ . A kernel density estimate (KDE) of the ABC posterior distribution (bivariate marginals of $\bm{\mu}_{0}$ and $\bm{\mu}_{1}$ ) is presented in Figure 2.

5.2 Moving-average model of order 2

The moving-average model of order $q$ , MA( $q$ ), is a stochastic process $\{Y_{t}\}_{t\in\mathbb{N}^{\ast}}$ defined as

[TABLE]

with $\{Z_{t}\}_{t\in\mathbb{Z}}$ being a sequence of unobserved noise error terms. Jiang et al., (2018) used a MA $(2)$ model for their benchmarking; namely $Y_{t}=Z_{t}+\theta_{1}Z_{t-1}+\theta_{2}Z_{t-2},\ t\in[D]$ . Each observation $\bm{Y}$ corresponds to a time series of length $D$ . Here, we use the same model as that proposed in Jiang et al., (2018), where $Z_{t}$ follows the Student- $t$ distribution with $5$ degrees of freedom, and $D=10$ . The priors on the model parameters $\theta_{1}$ and $\theta_{2}$ are taken to be uniform, that is, $\theta_{1}\sim\text{Unif}(-2,2)$ and $\theta_{2}\sim\text{Unif}(-1,1)$ . We performed ABC using $n=200$ samples generated from a model with the true parameter values $(\theta_{1},\theta_{2})=(0.6,0.2)$ . A KDE of the ABC joint posterior distribution of $(\theta_{1},\theta_{2})$ is displayed in Figure 3.

5.3 Bivariate beta model

The bivariate beta model proposed by Crackel and Flegal, (2017) is defined with five positive parameters $\theta_{1},\ldots,\theta_{5}$ by letting

[TABLE]

where $U_{i}\sim\text{Gamma}(\theta_{i},1)$ , for $i\in[5]$ , and setting $Z_{1}=V_{1}/(1+V_{1})$ and $Z_{2}=V_{2}/(1+V_{2})$ . The bivariate random variable $\bm{Z}^{\top}=(Z_{1},Z_{2})$ has marginal laws $Z_{1}\sim\mathrm{Beta}(\theta_{1}+\theta_{3},\theta_{5}+\theta_{4})$ and $Z_{2}\sim\mathrm{Beta}(\theta_{2}+\theta_{4},\theta_{5}+\theta_{3})$ . We performed ABC using samples of size $n=500$ , which are generated from a DGP with true parameter values $(\theta_{1},\theta_{2},\theta_{3},\theta_{4},\theta_{5})=(1,1,1,1,1)$ . The prior on each of the model parameters is taken to be independent $\mathrm{Unif}(0,5)$ . KDEs of the marginal ABC posterior distributions of parameters $\theta_{1},\theta_{2},\theta_{3},\theta_{4}$ and $\theta_{5}$ are displayed in Figure 4.

5.4 Multivariate g-and-k distribution

A univariate $g$ -and- $k$ distribution can be defined via its quantile function (Drovandi and Pettitt,, 2011):

[TABLE]

where parameters $(A,B,g,k)$ respectively relate to location, scale, skewness, and kurtosis. Here, $z_{x}$ is the $x$ th quantile of the standard normal distribution. Given a set of parameters $(A,B,g,k)$ , it is easy to simulate $D$ observations of a DGP with quantile function (21), by generating a sequence of IID sample $\{Z_{i}\}_{i=1}^{D}$ , where $Z_{i}\sim\mathcal{N}(0,1)$ , for $i\in[D]$ .

A so-called $D$ -dimensional $g$ -and- $k$ DGP can instead be defined by applying the quantile function (21) to each of the $D$ elements of a multivariate normal vector $\bm{Z}^{\top}=(Z_{1},...,Z_{D})\sim\mathcal{N}(\bm{0},\bm{\Sigma})$ , where $\bm{\Sigma}$ is a covariance matrix. In our experiment, we use a 5-dimensional $g$ -and- $k$ model with the same covariance matrix and parameter values for $(A,B,g,k)$ as that considered by Jiang et al., (2018). That is, we generate samples of size $n=200$ from a $g$ -and- $k$ DGP with the true parameter values $(A,B,g,k)=(3,1,2,0.5)$ and the covariance matrix

[TABLE]

where $\rho=-0.3$ . The prior on the model parameters $A,B,g,k$ is taken to be independent $\mathrm{Unif}(0,4)$ , while $\rho$ is independently assigned a $\mathrm{Unif}(-0.5,0.5)$ prior. KDEs of the marginal ABC posterior distributions of parameters $A,B,g,k$ and $\rho$ are displayed in Figure 5.

5.5 Discussion of the results and performance

For each of the four experiments and each parameter, we computed the posterior mean $\hat{\theta}_{\text{mean}}$ , posterior median $\hat{\theta}_{\text{med}}$ , mean absolute error and mean squared error defined by

[TABLE]

where $\left\{\theta_{k}\right\}_{k=1}^{M}$ denotes the pseudo-posterior sample and $\theta_{0}$ denotes the true parameter. Here $M=50$ since $N=10^{5}$ and $\epsilon$ is chosen as to retain $0.05\%$ of the samples. Each experiment was replicated ten times by keeping the same fixed (true) values for the parameters and by sampling new observed data each of the ten times. The estimated quantities $\hat{\theta}_{\text{mean}}$ , $\hat{\theta}_{\text{med}}$ , and errors MAE and $\text{RMSE}=\text{MSE}^{1/2}$ were then averaged over the ten replications, and are reported along with standard deviations $\sigma(\cdot)$ in columns associated with each estimator and true values $\theta_{0}$ for each parameter in Tables 1, 2, 3 and 5.

Upon inspection, Tables 1, 2, 3 and 5 showed some advantage in performance from WA on the bivariate Gaussian mixtures, some advantage from the MMD on the bivariate beta model, and some advantage from the ES on the $g$ -and- $k$ model, while multiple methods are required to make the best inference in the case of the MA $(2)$ experiment. When we further take into account the standard deviations of the estimators, we observe that all four data discrepancy measures essentially perform comparatively well across the four experimental models. Thus, we may conclude that there is no universally best performing discrepancy measure. Some considerations are therefore necessary when choosing between discrepancies. The first point of consideration is whether the data $\mathbf{X}_{n}$ are random variables arising from continuous or discrete measures. In the case that the data $\mathbf{X}_{n}$ arises from a discrete measure, the KL discrepancy measure is not applicable, since it is not defined on a set of measure greater than zero. Another consideration regarding the choice of discrepancy measures is the computational complexity of each discrepancy measure, as is summarized in Table 4.

From Table 4, we firstly note that in the case of univariate data, all methods have the same computational complexity, as all of the discrepancy measures amount to comparisons between the order statistics of the observed and simulated data. Computational complexity becomes a greater separating criterion when considering the multivariate setting. In the multivariate case, the KL divergence is clearly faster than the other methods, but as mentioned before, is not applicable for discrete data. The ES and MMD methods share the same order of complexity, $\mathcal{O}((n+m)^{2})$ , due to their theoretical equivalence (cf. Sejdinovic et al., (2013)). It is notable that, in general, the computational complexity of the WA discrepancy is of order $\mathcal{O}((n+m)^{5/2}\log(n+m))$ , which greater than that of the ES and MMD discrepancies, and is thus a significantly slower method when $n$ and $m$ get large. However, in our numerical results, we have used the $\mathcal{O}((n+m)^{2})$ swapping distance approximation of the WA method, as was considered in Bernton et al., (2019). Although this approximation is faster than the exact WA discrepancy, it does not converge to the same value, in general, and thus theoretical results regarding the WA discrepancy cannot be directly applied to the approximation (although some theoretical statements are still available). Thus, there is a trade-off regarding theoretical outcomes when using the swap distance approximation.

We note that in the case when the MMD discrepancy measure is estimated by the V-statistic estimator, much of our theoretical results from Section 4, are applicable with minor modifications, due to the results of Sejdinovic et al., (2013). Thus, the choice between the ES and the MMD method comes down to a preference for the use of kernels or metrics. A consideration regarding the choice of the MMD discrepancy versus the ES discrepancy is that, to the best of our knowledge, a comparable result to (6) does not exist for any common kernel choice.

As an alternative to choosing one of the assessed discrepancy measures, one may also consider some kind of averaging over the results of the different discrepancy measures. We have not committed to an investigation of such methodologies and leave it as a future research direction.

Running times (on a MacBook Pro 3,1 GHz) for the ES, KL, MMD and WA distance computations for $10^{5}$ ABC replications, in the four models considered in the simulations, for varying sample sizes $n$ , and with $m=n$ , are reported in Figure 6. ES is uniformly much faster than the other approaches for small samples sizes, up to the value of $n=m=50$ , where it is performing as fast as KL. For sample sizes larger than $n=m=50$ , KL is fastest. Overall, MMD and WA are slower than ES and KL.

6 Conclusion

We have introduced a novel importance-sampling ABC algorithm that is based on the so-called two-sample energy statistic. Along with other data discrepancy measures that view data sets as empirical measures, such as the Kullback–Leibler divergence, the Wasserstein distance and maximum mean discrepancies, our proposed approach bypasses the cumbersome use of summary statistics.

We have shown that the V-statistic estimator of the ES is consistent under mild moment conditions. Furthermore, we have established a new asymptotic result for cases when the observed sample and simulated sample sizes increasing to infinity, that shows a kind of consistency of the pseudo-posterior in the infinite data scenario. This is in concordance with previous results in such cases (see for instance Jiang et al.,, 2018, Bernton et al.,, 2019) and extends upon existing theory for the application in the general IS-ABC framework. That is, we largely extend the main result of Jiang et al., (2018), regarding the large sample properties of the pseudo-posterior PDF, to the IS-ABC cases that are considered in Karabatsos and Leisen, (2018) and Park et al., (2016). Thus, we provide further theoretical justification for the usage of such algorithms.

Illustrations of the proposed ES-ABC algorithm on four experimental models have shown that it performs comparatively well to alternative discrepancy measures.

Considering computing costs, the ES, KL, MMD, and WA estimators in univariate settings are all equal in terms of order of complexity, with a linearithmic computational time of $\mathcal{O}((n+m)\log(n+m))$ (see Huo and Székely, (2016), Chaudhuri and Hu, (2019), regarding the complexity of the ES and MMD estimators). In multivariate settings, KL complexity is unchanged; ES and MMD have quadratic time $\mathcal{O}((n+m)^{2})$ , while the Wasserstein distance has complexity $\mathcal{O}((n+m)^{5/2}\log(n+m))$ . The latter can be reduced to quadratic complexity if one is targetting the swapping distance, an approximation of the actual Wasserstein distance (Bernton et al.,, 2019). We note that linear time estimators are also available for the MMD and the ES, if one is willing to forgo precision in the estimates (see Gretton et al., (2012)). See Table 4 for a summary.

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Barber, (2012) Barber, D. (2012). Bayesian Reasoning and Machine Learning . Cambridge University Press, Cambridge.
2Barber et al., (2015) Barber, S., Voss, J., and Webster, M. (2015). The rate of convergence for approximate Bayesian computation. Electronic Journal of Statistics , 9(1):80–105.
3Baringhaus and Franz, (2004) Baringhaus, L. and Franz, C. (2004). On a new multivariate two-sample test. Journal of Multivariate Analysis , 88:190–206.
4Beaumont et al., (2002) Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate bayesian computation in population genetics. Genetics , 162:2025–2035.
5Bernton et al., (2019) Bernton, E., Jacob, P. E., Gerber, M., and Robert, C. P. (2019). Approximate Bayesian computation with the Wasserstein distance. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 81:235–269.
6Beygelzimer et al., (2013) Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., and Li, S. (2013). FNN: Fast Nearest Neighbor Search miller and Applications . R package version 1.1.3.
7Biau et al., (2015) Biau, G., Cérou, F., and Guyader, A. (2015). New insights into approximate Bayesian computation. Annales de l’IHP Probabilités et statistiques , 51(1):376–403.
8(8) Blum, M. G. (2010 a). Approximate Bayesian computation: a nonparametric perspective. Journal of the American Statistical Association , 105(491):1178–1187.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Approximate Bayesian computation via the energy statistic

Abstract

1 Introduction

2 Importance sampling ABC

Algorithm 1**.**

3 The energy statistic (ES)

3.1 The V-statistic estimator

3.2 The ES-based IS-ABC algorithm

3.3 Related methods

4 Theoretical results

4.1 Behavior as n→∞{n}\to\inftyn→∞ and m→∞{m}\to\inftym→∞

4.1.1 Analysis with a generic discrepancy

Theorem 1**.**

Theorem 2**.**

Proof.

Proposition 1**.**

Proof.

4.1.2 Analysis with the energy statistic

Lemma 1**.**

Corollary 1**.**

Proof.

Corollary 2**.**

4.2 Behavior as ϵ→0\mathbf{\epsilon}\to\mathbf{0}ϵ→0

Proposition 2**.**

Proposition 3**.**

4.3 Illustration on a simple example

5 Illustrations

5.1 Bivariate Gaussian mixture model

5.2 Moving-average model of order 2

5.3 Bivariate beta model

5.4 Multivariate g-and-k distribution

5.5 Discussion of the results and performance

6 Conclusion

Algorithm 1.

4.1 Behavior as ${n}\to\infty$ and ${m}\to\infty$

Theorem 1.

Theorem 2.

Proposition 1.

Lemma 1.

Corollary 1.

Corollary 2.

4.2 Behavior as $\mathbf{\epsilon}\to\mathbf{0}$

Proposition 2.

Proposition 3.