Finding our Way in the Dark: Approximate MCMC for Approximate Bayesian   Methods

Evgeny Levi; Radu V. Craiu

arXiv:1905.06680·stat.CO·May 17, 2019

Finding our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods

Evgeny Levi, Radu V. Craiu

PDF

TL;DR

This paper introduces perturbed MCMC algorithms that recycle past samples to accelerate approximate Bayesian methods like ABC and BSL, making complex Bayesian analyses more computationally feasible.

Contribution

It presents a novel MCMC approach that enhances efficiency of ABC and BSL by leveraging sample recycling, supported by theoretical analysis and empirical validation.

Findings

01

Significant reduction in the number of simulations needed.

02

Maintains accuracy while improving computational speed.

03

Effective in complex Bayesian models.

Abstract

With larger data at their disposal, scientists are emboldened to tackle complex questions that require sophisticated statistical models. It is not unusual for the latter to have likelihood functions that elude analytical formulations. Even under such adversity, when one can simulate from the sampling distribution, Bayesian analysis can be conducted using approximate methods such as Approximate Bayesian Computation (ABC) or Bayesian Synthetic Likelihood (BSL). A significant drawback of these methods is that the number of required simulations can be prohibitively large, thus severely limiting their scope. In this paper we design perturbed MCMC samplers that can be used within the ABC and BSL paradigms to significantly accelerate computation while maintaining control on computational efficiency. The proposed strategy relies on recycling samples from the chain's past. The algorithmic design…

Figures23

Click any figure to enlarge with its caption.

Tables5

Table 1. Table 1: Simulation Results (MA model): Average Difference in mean, Difference in covariance, Total variation, square roots of Bias, Variance and MSE, Effective sample size and Effective sample size per CPU time, for every sampling algorithm.

	Diff with exact			Diff with true parmater			Efficiency
Sampler	DIM	DIC	TV	$\sqrt{{Bias}^{2}}$	$\sqrt{VAR}$	$\sqrt{MSE}$	ESS	ESS/CPU
SMC	0.082	0.0045	0.418	0.014	0.115	0.116	471	0.505
ABC-RW	0.088	0.0063	0.466	0.016	0.123	0.124	23	0.231
ABC-IS	0.084	0.0067	0.455	0.016	0.115	0.116	44	0.389
AABC-U	0.083	0.0071	0.444	0.018	0.116	0.117	3446	6.215
AABC-L	0.080	0.0067	0.438	0.017	0.112	0.113	2820	5.107
BSL-RW	0.082	0.0070	0.438	0.015	0.114	0.115	252	0.282
BSL-IS	0.081	0.0070	0.436	0.015	0.114	0.115	841	0.923
ABSL-U	0.081	0.0095	0.443	0.017	0.114	0.115	3950	5.584
ABSL-L	0.082	0.0078	0.441	0.015	0.114	0.115	4165	6.030

Table 2. Table 2: Simulation Results (Ricker’s model): Average Difference in mean, Difference in covariance, Total variation, square roots of Bias, variance and MSE, Effective sample size and Effective sample size per CPU time, for every sampling algorithm.

	Diff with exact			Diff with true parmater			Efficiency
Sampler	DIM	DIC	TV	$\sqrt{{Bias}^{2}}$	$\sqrt{VAR}$	$\sqrt{MSE}$	ESS	ESS/CPU
SMC	0.152	0.0177	0.378	0.086	0.201	0.219	472	0.521
ABC-RW	0.135	0.0201	0.389	0.059	0.180	0.189	87	0.199
ABC-IS	0.139	0.0215	0.485	0.063	0.195	0.205	47	0.099
AABC-U	0.147	0.0279	0.402	0.076	0.190	0.204	3563	4.390
AABC-L	0.141	0.0258	0.392	0.070	0.189	0.201	4206	5.193
BSL-RW	0.129	0.0080	0.382	0.038	0.206	0.209	131	0.030
BSL-IS	0.122	0.0082	0.455	0.022	0.197	0.198	33	0.007
ABSL-U	0.103	0.0054	0.377	0.023	0.170	0.171	284	0.180
ABSL-L	0.106	0.0051	0.382	0.012	0.173	0.173	207	0.135

Table 3. Table 3: Simulation Results (SV model): Average Difference in mean, Difference in covariance, Total variation, square roots of Bias, variance and MSE, Effective sample size and Effective sample size per CPU time, for every sampling algorithm.

	Diff with exact			Diff with true parmater			Efficiency
Sampler	DIM	DIC	TV	$\sqrt{{Bias}^{2}}$	$\sqrt{VAR}$	$\sqrt{MSE}$	ESS	ESS/CPU
SMC	0.232	0.0428	0.417	0.187	0.255	0.316	471	0.336
ABC-RW	0.210	0.0396	0.459	0.228	0.255	0.342	31	0.097
ABC-IS	0.179	0.0439	0.460	0.196	0.219	0.294	30	0.090
AABC-U	0.194	0.0447	0.424	0.212	0.217	0.304	1793	2.445
AABC-L	0.189	0.0441	0.420	0.211	0.235	0.316	1659	2.253
BSL-RW	0.200	0.0360	0.411	0.175	0.227	0.287	131	0.043
BSL-IS	0.195	0.0362	0.404	0.175	0.225	0.285	346	0.113
ABSL-U	0.229	0.0422	0.551	0.184	0.241	0.303	871	0.822
ABSL-L	0.231	0.0410	0.548	0.197	0.240	0.311	843	0.817

Table 4. Table 4: Simulation Results (SV α 𝛼 {\alpha} -Stable model): Average Difference in mean, Difference in covariance, Total variation, square roots of Bias, variance and MSE, Effective sample size and Effective sample size per CPU time, for every sampling algorithm. In DIM, DIC and TV, samplers are compared to SMC.

	Diff with SMC			Diff with true parmater			Efficiency
Sampler	DIM	DIC	TV	$\sqrt{{Bias}^{2}}$	$\sqrt{VAR}$	$\sqrt{MSE}$	ESS	ESS/CPU
SMC	0.000	0.0000	0.000	0.221	0.201	0.299	468	0.267
ABC-RW	0.078	0.0126	0.205	0.248	0.198	0.317	24	0.069
ABC-IS	0.082	0.0151	0.306	0.232	0.221	0.320	26	0.071
AABC-U	0.069	0.0124	0.170	0.250	0.183	0.310	1303	1.617
AABC-L	0.069	0.0132	0.161	0.246	0.181	0.305	1256	1.546
BSL-RW	0.044	0.0116	0.122	0.225	0.181	0.289	123	0.037
BSL-IS	0.045	0.0103	0.125	0.226	0.177	0.287	285	0.084
ABSL-U	0.063	0.0133	0.228	0.225	0.181	0.289	832	0.735
ABSL-L	0.061	0.0140	0.230	0.236	0.183	0.299	757	0.671

Table 5. Table 5: Dow Jones log return stochastic volatility: 95% credible intervals and posterior averages for 4 parameters for two proposed samplers (AABC-U and ABSL-U).

	AABC-U			ABSL-U
Parameter	2.5% Quantile	Average	97.5% Quantile	2.5% Quantile	Average	97.5% Quantile
$θ_{1}$	0.787	0.899	0.990	0.775	0.856	0.959
$θ_{2}$	-0.411	-0.147	0.112	-0.369	-0.092	0.222
$θ_{3}$	-1.405	-0.790	-0.304	-1.858	-0.841	-0.206
$θ_{4}$	1.758	1.916	1.997	1.721	1.909	1.996

Equations124

π (θ ∣ y_{0}) = \frac{p ( θ ) f ( y _{0} ∣ θ )}{\int _{R^{q}} p ( θ ) f ( y _{0} ∣ θ ) d θ} \propto p (θ) f (y_{0} ∣ θ),

π (θ ∣ y_{0}) = \frac{p ( θ ) f ( y _{0} ∣ θ )}{\int _{R^{q}} p ( θ ) f ( y _{0} ∣ θ ) d θ} \propto p (θ) f (y_{0} ∣ θ),

X_{0} \sim

X_{0} \sim

X_{i} ∣ x_{i - 1} \sim

Y_{i} ∣ x_{i} \sim

ϵ ↓ 0 lim π_{ϵ} (θ ∣ S (y_{0})) = π (θ ∣ S (y_{0})) .

ϵ ↓ 0 lim π_{ϵ} (θ ∣ S (y_{0})) = π (θ ∣ S (y_{0})) .

π_{ϵ} (θ, y ∣ y_{0}) \propto p (θ) f (y ∣ θ) 1_{{δ (y_{0}, y) < ϵ}},

π_{ϵ} (θ, y ∣ y_{0}) \propto p (θ) f (y ∣ θ) 1_{{δ (y_{0}, y) < ϵ}},

π_{ϵ} (θ ∣ y_{0}) = \int π_{ϵ} (θ, y ∣ y_{0}) d θ \propto \int p (θ) f (y ∣ θ) 1_{{δ (y_{0}, y) < ϵ}} d θ = p (θ) \mbox P r (δ (y_{0}, y) < ϵ ∣ θ) .

π_{ϵ} (θ ∣ y_{0}) = \int π_{ϵ} (θ, y ∣ y_{0}) d θ \propto \int p (θ) f (y ∣ θ) 1_{{δ (y_{0}, y) < ϵ}} d θ = p (θ) \mbox P r (δ (y_{0}, y) < ϵ ∣ θ) .

π_{ϵ} (θ ∣ y_{0}) \propto p (θ) P (δ (y_{0}, y) < ϵ ∣ θ),

π_{ϵ} (θ ∣ y_{0}) \propto p (θ) P (δ (y_{0}, y) < ϵ ∣ θ),

\hat{h} (ζ^{*}) = \frac{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} ) 1 _{{δ_{n} < ϵ}}}{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} )},

\hat{h} (ζ^{*}) = \frac{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} ) 1 _{{δ_{n} < ϵ}}}{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} )},

\hat{h} (ζ^{*}) = \frac{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} ) 1 _{{\tilde{δ}_{n} < ϵ}}}{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} )},

\hat{h} (ζ^{*}) = \frac{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} ) 1 _{{\tilde{δ}_{n} < ϵ}}}{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} )},

\tilde{h} (ζ^{*}) = \frac{1}{K} j = 1 \sum K 1_{{\tilde{δ}_{j} < ϵ}},

\tilde{h} (ζ^{*}) = \frac{1}{K} j = 1 \sum K 1_{{\tilde{δ}_{j} < ϵ}},

\hat{h} (ζ^{*}) = \frac{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} ) 1 _{{\tilde{δ}_{n} < ϵ}}}{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} )},

\hat{h} (ζ^{*}) = \frac{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} ) 1 _{{\tilde{δ}_{n} < ϵ}}}{\sum _{n = 1}^{N} W _{N n} ( ζ ^{*} )},

\overset{μ}{^}_{θ} \hat{Σ}_{θ} = \frac{\sum _{i = 1}^{m} s _{i}}{m}, = \frac{\sum _{i = 1}^{m} ( s _{i} - μ ^ _{θ} ) ( s _{i} - μ ^ _{θ} ) ^{T}}{m - 1},

\overset{μ}{^}_{θ} \hat{Σ}_{θ} = \frac{\sum _{i = 1}^{m} s _{i}}{m}, = \frac{\sum _{i = 1}^{m} ( s _{i} - μ ^ _{θ} ) ( s _{i} - μ ^ _{θ} ) ^{T}}{m - 1},

S L (θ ∣ y_{0}) = N (S (y_{0}); \overset{μ}{^}_{θ}, \hat{Σ}_{θ}) .

S L (θ ∣ y_{0}) = N (S (y_{0}); \overset{μ}{^}_{θ}, \hat{Σ}_{θ}) .

π (θ ∣ s_{0}) \propto p (θ) N (s_{0}; μ_{θ}, Σ_{θ}) .

π (θ ∣ s_{0}) \propto p (θ) N (s_{0}; μ_{θ}, Σ_{θ}) .

\overset{μ}{^}_{ζ} \hat{Σ}_{ζ} = \frac{\sum _{n = 1}^{N} [ W _{N n} ( ζ ) \sum _{j = 1}^{m} s ~ _{n}^{(j)} ]}{m \sum _{n = 1}^{N} W _{N n} ( ζ )}, = \frac{\sum _{i = 1}^{N} [ W _{N n} ( ζ ) \sum _{j = 1}^{m} ( s ~ _{n}^{(j)} - μ ^ _{ζ} ) ( s ~ _{n}^{(j)} - μ ^ _{ζ} ) ^{T} ]}{m \sum _{i = 1}^{N} W _{N n} ( ζ )} .

\overset{μ}{^}_{ζ} \hat{Σ}_{ζ} = \frac{\sum _{n = 1}^{N} [ W _{N n} ( ζ ) \sum _{j = 1}^{m} s ~ _{n}^{(j)} ]}{m \sum _{n = 1}^{N} W _{N n} ( ζ )}, = \frac{\sum _{i = 1}^{N} [ W _{N n} ( ζ ) \sum _{j = 1}^{m} ( s ~ _{n}^{(j)} - μ ^ _{ζ} ) ( s ~ _{n}^{(j)} - μ ^ _{ζ} ) ^{T} ]}{m \sum _{i = 1}^{N} W _{N n} ( ζ )} .

\overset{μ}{^}_{ζ^{*}}

\overset{μ}{^}_{ζ^{*}}

\hat{Σ}_{ζ^{*}}

\overset{μ}{^}_{θ^{(t)}}

\overset{μ}{^}_{θ^{(t)}}

\hat{Σ}_{θ^{(t)}}

\mbox D i f f inm e an (D I M) = M e a n_{r, s} (∣ M e a n_{t} (θ_{r s}^{(t)}) - M e a n_{t} (\tilde{θ}_{r s}^{(t)}) ∣), \mbox D i f f in co v a r ian ce (D I C) = M e a n_{r, s} (∣ C o v_{t} (θ_{r s}^{(t)}) - C o v_{t} (\tilde{θ}_{r s}^{(t)}) ∣), \mbox T o t a l V a r ia t i o n (T V) = M e a n_{r, s} (0.5 \int ∣ D_{r s} (x) - \tilde{D}_{r s} (x) ∣ d x), \mbox B ia s^{2} = M e a n_{s} ((M e a n_{t r} (θ_{r s}^{(t)}) - θ_{s}^{t r u e})^{2}), \mbox V A R = M e a n_{s} (V a r_{r} (M e a n_{t} (θ_{r s}^{(t)}))), \mbox M S E = \mbox B ia s^{2} + \mbox V A R,

\mbox D i f f inm e an (D I M) = M e a n_{r, s} (∣ M e a n_{t} (θ_{r s}^{(t)}) - M e a n_{t} (\tilde{θ}_{r s}^{(t)}) ∣), \mbox D i f f in co v a r ian ce (D I C) = M e a n_{r, s} (∣ C o v_{t} (θ_{r s}^{(t)}) - C o v_{t} (\tilde{θ}_{r s}^{(t)}) ∣), \mbox T o t a l V a r ia t i o n (T V) = M e a n_{r, s} (0.5 \int ∣ D_{r s} (x) - \tilde{D}_{r s} (x) ∣ d x), \mbox B ia s^{2} = M e a n_{s} ((M e a n_{t r} (θ_{r s}^{(t)}) - θ_{s}^{t r u e})^{2}), \mbox V A R = M e a n_{s} (V a r_{r} (M e a n_{t} (θ_{r s}^{(t)}))), \mbox M S E = \mbox B ia s^{2} + \mbox V A R,

\mbox A C T_{r s} = 1 + 2 a = 1 \sum \infty ρ_{a} (θ_{r s}^{(t)}),

\mbox A C T_{r s} = 1 + 2 a = 1 \sum \infty ρ_{a} (θ_{r s}^{(t)}),

\mbox E S S = M e a n_{r s} ((M - B) / \mbox A C T_{r s}), \mbox E S S / C P U = M e a n_{r s} ((M - B) / \mbox A C T_{r s} / C P U_{r}) .

\mbox E S S = M e a n_{r s} ((M - B) / \mbox A C T_{r s}), \mbox E S S / C P U = M e a n_{r s} ((M - B) / \mbox A C T_{r s} / C P U_{r}) .

z_{i} \sim ii d N (0, 1); i = {- 1, 0, 1, \dots, n}, y_{i} = z_{i} + θ_{1} z_{i - 1} + θ_{2} z_{i - 2}; i = {1, \dots, n} .

z_{i} \sim ii d N (0, 1); i = {- 1, 0, 1, \dots, n}, y_{i} = z_{i} + θ_{1} z_{i - 1} + θ_{2} z_{i - 2}; i = {1, \dots, n} .

θ_{1} + θ_{2} > - 1, θ_{1} - θ_{2} < 1, - 2 < θ_{1} < 2, - 1 < θ_{2} < 2.

θ_{1} + θ_{2} > - 1, θ_{1} - θ_{2} < 1, - 2 < θ_{1} < 2, - 1 < θ_{2} < 2.

x_{- 49} = 1; z_{i} \sim ii d N (0, exp (θ_{2})^{2}); i = {- 48, \dots, n}, x_{i} = exp (exp (θ_{1})) x_{i - 1} exp (- x_{i - 1} + z_{i}); i = {- 48, \dots, n}, y_{i} = P o i s (exp (θ_{3}) x_{i}); i = {- 48, \dots, n},

x_{- 49} = 1; z_{i} \sim ii d N (0, exp (θ_{2})^{2}); i = {- 48, \dots, n}, x_{i} = exp (exp (θ_{1})) x_{i - 1} exp (- x_{i - 1} + z_{i}); i = {- 48, \dots, n}, y_{i} = P o i s (exp (θ_{3}) x_{i}); i = {- 48, \dots, n},

θ_{1} \sim N (0, 1), θ_{2} \sim U ni f (- 2.3, 0), θ_{3} \sim N (0, 4) .

θ_{1} \sim N (0, 1), θ_{2} \sim U ni f (- 2.3, 0), θ_{3} \sim N (0, 4) .

x_{1} \sim N (0, 1/ (1 - θ_{1}^{2})); v_{i} \sim ii d N (0, 1); w_{i} \sim ii d N (0, 1); i = {1, \dots, n}, x_{i} = θ_{1} x_{i - 1} + v_{i}; i = {2, \dots, n}, y_{i} = exp (θ_{2} + exp (θ_{3}) x_{i}) w_{i}; i = {1, \dots, n} .

x_{1} \sim N (0, 1/ (1 - θ_{1}^{2})); v_{i} \sim ii d N (0, 1); w_{i} \sim ii d N (0, 1); i = {1, \dots, n}, x_{i} = θ_{1} x_{i - 1} + v_{i}; i = {2, \dots, n}, y_{i} = exp (θ_{2} + exp (θ_{3}) x_{i}) w_{i}; i = {1, \dots, n} .

θ_{1} \sim U ni f (0, 1), θ_{2} \sim N (0, 1), θ_{3} \sim N (0, 1) .

θ_{1} \sim U ni f (0, 1), θ_{2} \sim N (0, 1), θ_{3} \sim N (0, 1) .

x_{1} \sim N (0, 1/ (1 - θ_{1}^{2})); v_{i} \sim ii d N (0, 1); w_{i} \sim ii d S t ab (θ_{4}, - 1); i = {1, \dots, n}, x_{i} = θ_{1} x_{i - 1} + v_{i}; i = {2, \dots, n}, y_{i} = exp (θ_{2} + exp (θ_{3}) x_{i}) w_{i}; i = {1, \dots, n} .

x_{1} \sim N (0, 1/ (1 - θ_{1}^{2})); v_{i} \sim ii d N (0, 1); w_{i} \sim ii d S t ab (θ_{4}, - 1); i = {1, \dots, n}, x_{i} = θ_{1} x_{i - 1} + v_{i}; i = {2, \dots, n}, y_{i} = exp (θ_{2} + exp (θ_{3}) x_{i}) w_{i}; i = {1, \dots, n} .

θ_{1} \sim U ni f (0, 1), θ_{2} \sim N (0, 1), θ_{3} \sim N (0, 1), θ_{4} \sim U ni f (1.5, 2) .

θ_{1} \sim U ni f (0, 1), θ_{2} \sim N (0, 1), θ_{3} \sim N (0, 1), θ_{4} \sim U ni f (1.5, 2) .

r_{i} = lo g (P_{i}) - lo g (P_{i - 1}), i = 2, \dots, n .

r_{i} = lo g (P_{i}) - lo g (P_{i - 1}), i = 2, \dots, n .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Finding our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods

Evgeny Levi

Radu V. Craiu Email: [email protected]

(Department of Statistical Sciences, University of Toronto)

Abstract

With larger amounts of data at their disposal, scientists are emboldened to tackle complex questions that require sophisticated statistical models. It is not unusual for the latter to have likelihood functions that elude analytical formulations. Even under such adversity, when one can simulate from the sampling distribution, Bayesian analysis can be conducted using approximate methods such as Approximate Bayesian Computation (ABC) or Bayesian Synthetic Likelihood (BSL). A significant drawback of these methods is that the number of required simulations can be prohibitively large, thus severely limiting their scope. In this paper we design perturbed MCMC samplers that can be used within the ABC and BSL paradigms to significantly accelerate computation while maintaining control on computational efficiency. The proposed strategy relies on recycling samples from the chain’s past. The algorithmic design is supported by a theoretical analysis while practical performance is examined via a series of simulation examples and data analyses.

Keywords: Approximate Bayesian Computation, Synthetic Likelihood, Perturbed MCMC, k-Nearest-Neighbor.

1 Introduction

Since the early 1990s Bayesian statisticians have been able to operate largely due to the rapid development of Markov chain Monte Carlo (MCMC) sampling methods (see, for example Craiu and Rosenthal, 2014, for a recent review). Given observed data ${\mathbf{y}}_{0}\in{\mathcal{X}}^{n}$ with sampling density $f({\mathbf{y}}_{0}|{\theta})$ indexed by parameter $\theta\in{\mathbf{R}}^{q}$ , Bayesian inference for functions of ${\theta}$ rely on the characteristics of the posterior distribution

[TABLE]

where $p({\theta})$ denotes the prior distribution. When the posterior distribution in (1) cannot be studied analytically, we rely on MCMC algorithms to generate samples from $\pi$ . While traditional MCMC samplers such as Metropolis-Hastings or Hamiltonian MCMC (see Brooks et al., 2011, and references therein) can sample distributions with unknown normalizing constants, they rely on the closed form of the unnormalized posterior, $p({\theta})f({\mathbf{y}}_{0}|{\theta})$ .

The advent of large data has altered in multiple ways the framework we just described. For example, larger data tend to yield likelihood functions that are much more expensive to compute, thus exposing the liability inherent in the iterative nature of MCMC samplers. In response to this challenge, new computational methods based on divide and conquer (Scott et al., 2016; Wang and Dunson, 2013; Entezari et al., 2018), subsampling (Bardenet et al., 2014; Quiroz et al., 2015), or sequential Balakrishnan et al. (2006); Maclaurin and Adams (2015) strategies have emerged. Second, it is understood that larger data should yield answers to more complex problems. This implies the use of increasingly complex models, in as much as the sampling distribution is no longer available in closed form.

In the absence of a tractable likelihood function, statisticians have developed approximate methods to perform Bayesian inference when, for any parameter value ${\theta}\in{\mathbf{R}}^{q}$ , data ${\mathbf{y}}\sim f({\mathbf{y}}|{\theta})$ can be sampled from the model. Here we consider two alternative approaches that have been proposed and gained considerable momentum in recent years: the Approximate Bayesian Computation (ABC) (Marin et al., 2012; Baragatti and Pudlo, 2014; Sisson et al., 2018a; Drovandi, 2018) and the Bayesian Synthetic Likelihood (BSL)(Wood, 2010; Drovandi et al., 2018; Price et al., 2018). Both algorithms are effective when they are combined with Markov chain Monte Carlo sampling schemes to produce samples from an approximation of $\pi$ and both share the need for generating many pseudo-data sets ${\mathbf{y}}\sim f({\mathbf{y}}|{\theta})$ . This comes with serious challenges when the data is large and generating a pseudo-data set is computationally expensive. In this paper we tackle the reduction of computational burden by recycling draws from the chain’s history. While this reduces drastically the computation time, it alters the transition kernel of the original MCMC chain. We demonstrate that we can control the approximating error introduced when perturbing the original kernel using some of the error analysis for perturbed Markov chains developed recently by Mitrophanov (2005), Johndrow et al. (2015b) and Johndrow and Mattingly (2017).

The paper is structured as follows. Section 2 briefly reviews the ABC method and Section 3 introduces the proposed MCMC algorithms for ABC. Section 4 reviews BSL sampling and extends the proposed methods to this class of approximations. The practical impact of these algorithms is evaluated via simulations in Section 5 and data analyses in Section 6. The theoretical analysis showing control of perturbation errors in total variation norm is in Section 7. The paper closes with conclusions and ideas for future work.

2 Approximate Bayesian Computation

In order to illustrate the ABC sampler, let us consider the following Hidden Markov Model (HMM)

[TABLE]

Unless Gaussian distributions are used to specify the transition and emission laws given in (2) and (3), respectively, the marginal distribution $P(y_{1},\cdots,y_{n}|{\theta})$ cannot be calculated in closed form. It is possible to treat the hidden random variables $X_{i}$ as auxiliary and sample them using Particle MCMC (PMCMC) (Andrieu et al., 2010) or ensemble MCMC (Shestopaloff and Neal, 2013). However, computations become increasingly difficult as $n$ increases. Moreover, for some financial time series models such as Stochastic Volatility for log return, the ${\alpha}$ -Stable distribution may be useful to model transition and/or emission probabilities (Nolan, 2003). The challenge is that the stable distributions do not have closed form densities, thus rendering the particle and ensemble MCMC impossible to use. Other widely used examples where the likelihood functions cannot be expressed analytically include various networks models (e.g., Kolaczyk and Csárdi, 2014) and Markov random fields (Rue and Held, 2005). For such models with intractable or computationally expensive likelihood evaluations, simulation based algorithms such as ABC are frequently used for inference. In its simplest form, the ABC is an accept/reject sampler. Given a user-defined summary statistic $S({\mathbf{y}})\in{\mathbf{R}}^{p}$ , the Accept/Reject ABC is described in Algorithm 1.

We emphasize that a closed form equation for the likelihood is not needed, only the ability to generate from $f({\mathbf{y}}|{\theta})$ for any ${\theta}$ . If $S({\mathbf{y}})$ is a sufficient statistics and $\mbox{Pr}(S({\mathbf{y}})=S({\mathbf{y}}_{0}))>0$ then the algorithm yields posterior samples from the true posterior $\pi({\theta}|{\mathbf{y}}_{0})$ . Alas, the level of complexity for models where ABC is needed, makes it unlikely for these two conditions to hold. In order to implement ABC under more realistic assumptions, a (small) constant ${\epsilon}$ is chosen and ${\zeta}^{*}$ is accepted whenever $d(S({\mathbf{y}}),S({\mathbf{y}}_{0}))<{\epsilon}$ , where $d(S({\mathbf{y}}),S({\mathbf{y}}_{0}))$ is a user-defined distance function. The introduction of ${\epsilon}>0$ and the use of non-sufficient statistics remove layers of exactness from the target distribution. The approximating distribution is denoted $\pi_{\epsilon}({\theta}|S({\mathbf{y}}_{0}))$ and we have

[TABLE]

In light of (4) one would like to have $S({\mathbf{y}})={\mathbf{y}}$ , but if the sample size of ${\mathbf{y}}_{0}$ is large, then the curse of dimensionality leads to $\mbox{Pr}(d({\mathbf{y}},{\mathbf{y}}_{0})<\epsilon)\approx 0$ . Thus, obtaining even a moderate number of samples using ABC can be an unattainable goal in this case. In almost all cases of interest, $S$ is not a sufficient statistics, implying that some information about ${\theta}$ is lost. Not surprisingly, much attention has been focused on finding appropriate low-dimensional summary statistics for inference (see, for example Robert et al., 2011; Fearnhead and Prangle, 2012; Marin et al., 2014; Prangle, 2015). In this paper we assume that the summary statistic $S({\mathbf{y}})$ is given.

In the absence of information about the model parameters, the prior and posterior distributions may be misaligned, having non-overlapping regions of mass concentration. Hence, parameter values that are drawn from the prior will be rarely retained making the algorithm very inefficient.

Algorithm 2 presents the ABC-MCMC algorithm of Marjoram et al. (2003) which avoids sampling from the prior and instead relies on building a chain with a Metropolis-Hastings (MH) transition kernel, with state space $\{({\theta},{\mathbf{y}})\in{\mathbf{R}}^{q}\times{\mathcal{X}}^{n}\}$ , proposal distribution $q({\zeta}|{\theta})\times f({\mathbf{y}}|{\zeta})$ and target distribution

[TABLE]

where $\delta({\mathbf{y}}_{0},{\mathbf{y}})=d(S({\mathbf{y}}),S({\mathbf{y}}_{0}))$ . Note that the goal is the marginal distribution for ${\theta}$ which is:

[TABLE]

There are a few alternatives to Algorithm 2. For instance, Lee et al. (2012) approximates $P(\delta({\mathbf{y}}_{0},{\mathbf{y}})<{\epsilon}|{\theta})$ via one of its unbiased estimators, $J^{-1}\sum_{j=1}^{J}1_{\{\delta({\mathbf{y}}_{0},{\mathbf{y}}_{j})<{\epsilon}\}}$ where $J\geq 1$ and each ${\mathbf{y}}_{j}$ is simulated from $f({\mathbf{y}}|{\theta})$ . The use of unbiased estimators for $P(\delta({\mathbf{y}}_{0},{\mathbf{y}})<{\epsilon}|{\theta})$ when computing the MH acceptance ratio can be validated using the theory of pseudo-marginal MCMC samplers (Andrieu and Roberts, 2009). Clearly, when the probability $P(\delta<{\epsilon}|{\theta})$ is very small, this method would require simulating a large number of $\delta$ s (or equivalently ${\mathbf{y}}$ s) in order to move to a new state. Other MCMC designs suitable for ABC can be found in Bornn et al. (2014).

Sequential Monte Carlo (SMC) samplers have also been successfully used for ABC (henceforth denoted ABC-SMC) (Sisson et al., 2007; Lee, 2012; Filippi et al., 2013). ABC-SMC requires a specified decreasing sequence ${\epsilon}_{0}>\cdots>{\epsilon}_{J}$ . Lee’s method Lee (2012) uses the Particle MCMC design (Andrieu et al., 2010) in which samples are updated as the target distribution evolves with ${\epsilon}$ . More specifically, it starts by sampling ${\theta}_{0}^{(1)},\ldots,{\theta}_{0}^{(M)}$ from $\pi_{{\epsilon}_{0}}({\theta}|{\mathbf{y}}_{0})$ using Accept-Reject ABC. Subsequently, at time $t+1$ all samples are sequentially updated so their distribution is $\pi_{{\epsilon}_{t+1}}({\theta}|{\mathbf{y}}_{0})$ (see Lee, 2012, for a complete description). The advantage of this method is not only that it starts from large ${\epsilon}$ , but also that it generates independent draws. A comprehensive coverage of computational techniques for ABC can be found in Sisson et al. (2018b) and references therein. We also note a general lack of guidelines concerning the selection of ${\epsilon}$ , which is unfortunate as the performance of ABC sampling depends heavily on its value. To make a fair comparison between different methods, we revise ABC-MCMC algorithm by introducing a decreasing sequence ${\epsilon}_{0}>\cdots>{\epsilon}_{J}$ ( $J$ is number of ”steps”) similar to ABC-SMC and ”learning” transition kernel during burn-in as in Algorithm 3.

Since the choice of proposal distribution $q(\cdot|{\theta})$ can considerably influence the performance of ABC-MCMC, we consider finite adaptation during the burn-in period of length $B$ . In addition, during burn-in the ${\epsilon}$ also varies, starting with a higher value (which makes it easier to find the initial ${\theta}^{(0)}$ value) and gradually decreasing in accordance to a pre-determined scheme. In our implementations we use independent MH sampling or RWM. In the former case, the proposal is Gaussian ${\mathcal{N}}(\cdot|\tilde{\mu},\tilde{\Sigma})$ with $c=3$ . The RWM proposal is ${\mathcal{N}}(\cdot|{\theta}^{(t-1)},\tilde{\Sigma})$ with $c=2.38^{2}/q$ (Roberts et al., 1997, 2001).

All the algorithms discussed so far rely on numerous generations of pseudo-data. Since the latter can be computationally costly, proposals for reducing the simulation cost are made in Wilkinson (2014) and Järvenpää et al. (2018). The approaches are based on learning the dependence between $\delta$ and ${\theta}$ and, from it, establishing directly whether a proposal ${\theta}$ should be accepted or not. Flexible regression models are used to model these unknown functional relationships. The overall performance depends on the signal to noise ratio and on the model’s performance in capturing patterns that can be highly complex.

To accelerate ABC-MCMC we consider a different approach and propose to store and utilize past simulations (with appropriate weights) in order to speed up the calculation while keeping under control the resulting approximating errors. The objective is to approximate $P(\delta<{\epsilon}|{\zeta}^{*})$ for any ${\zeta}^{*}$ at every MCMC iteration using past simulated $({\zeta},\delta)$ proposals, making the whole procedure computationally faster. The changes proposed perturb the chain’s transition kernel and we rely on the theory developed by Mitrophanov (2005) and Johndrow et al. (2015a) to assess the approximating error for the posterior. The k-Nearest-Neighbor (kNN) method is used to integrate past observations into the transition kernel. The main advantage of kNN is that it is uniformly strongly consistent which guarantees that for a large enough chain history, we can control the error between the intended stationary distribution and that of the proposed accelerated MCMC as shown in Section 7.

3 Approximated ABC-MCMC (AABC-MCMC)

In this section we describe an ABC-MCMC algorithm that utilizes past simulations to significantly improve computational efficiency. As noted previously, the ABC-MCMC with threshold ${\epsilon}$ targets the density

[TABLE]

where $\delta({\mathbf{y}}_{0},{\mathbf{y}})=d(S({\mathbf{y}}),S({\mathbf{y}}_{0}))$ with ${\mathbf{y}}\sim f({\mathbf{y}}|{\theta})$ and ${\theta}\in{\Theta}$ . Denote $h({\theta}):=P(\delta({\mathbf{y}}_{0},{\mathbf{y}})<{\epsilon}|{\theta})$ and note that if $h$ were known for every ${\theta}$ then we could run an MH-MCMC chain with invariant target density proportional to $p({\theta})h({\theta})$ . Alas, $h$ is almost always unknown and unbiased estimates can be computationally expensive or statistically inefficient. We build an alternative approach that relies on consistent estimates of $h$ that rely on the chain’s past history, are much cheaper to compute, and require a new theoretical treatment.

To fix ideas, suppose that at time $t$ we generate the proposal $(\zeta_{t+1},{\mathbf{w}}_{t+1})\sim q(\zeta|\theta^{(t)})f({\mathbf{w}}|\zeta)$ and suppose that at iteration $N$ , all the proposals $\zeta_{n}$ , regardless whether they were accepted or rejected, along with corresponding distances $\delta_{n}=\delta({\mathbf{w}}_{n},{\mathbf{y}}_{0})$ are available for $0\leq n\leq N-1$ . This past history is stored in the set ${\mathcal{Z}}_{N-1}=\{\zeta_{n},\delta_{n}\}_{n=1}^{N-1}$ . Given a new proposal $\zeta^{*}\sim q(|{\theta}^{(t)})$ , we generate ${\mathbf{w}}^{*}\sim f(\cdot|\zeta^{*})$ and compute $\delta^{*}=d(S({\mathbf{w}}^{*}),S({\mathbf{y}}_{0}))$ . Set $\zeta_{N}=\zeta^{*}$ , ${\mathbf{w}}_{N}={\mathbf{w}}^{*}$ , ${\mathcal{Z}}_{N}={\mathcal{Z}}_{N-1}\cup\{(\zeta_{N},\delta_{N})\}$ and estimate $h(\zeta^{*})$ using

[TABLE]

where $W_{Nn}(\zeta^{*})=W(\|\zeta_{n}-\zeta^{*}\|)$ are weights and $W:{\mathbf{R}}\rightarrow[0,\infty)$ is a decreasing function. We discuss a couple of choices for the function $W(\cdot)$ below.

Remark 1: Note that if some of the past proposals have been accepted, then the Markovian property of the chain is violated since the acceptance probability does not depend solely on the current state, but also on the past ones. We defer the theoretical considerations for dealing with adaptation in the context of perturbed Markov chains to a future communication. Below, we modify slightly the construction above while respecting the core idea.

In order to separate the samples used as proposals from those used to estimate $h$ in (8), we will generate at each time $t$ two independent samples $\zeta_{t+1}\sim q(\zeta|\theta^{(t)})$ and $(\tilde{\zeta}_{t+1},\tilde{\mathbf{w}}_{t+1})$ from $q(\zeta|\theta^{(t)})f({\mathbf{w}}|\zeta)$ . Then, the history ${\mathcal{Z}}$ collects the $(\tilde{\zeta},\tilde{\delta})$ samples while the proposal used to update the chain is the $\zeta$ sample. With this notation (8) becomes

[TABLE]

where $\tilde{\delta}_{n}=\delta(\tilde{\mathbf{w}},{\mathbf{y}}_{0})$ and $W_{Nn}(\zeta^{*})=W(\|\tilde{\zeta}_{n}-\zeta^{*}\|)$ .

Remark 2: Even if $\delta^{*}$ is greater than ${\epsilon}$ (which would trigger automatically rejection for ABC-MCMC), suppose there is a close neighbour of $\zeta^{*}$ whose corresponding $\delta$ is less than ${\epsilon}$ . Then the estimated $h({\zeta}^{*})$ will not be zero and there is a chance of moving to a different state. Intuitively, this is expected to reduce the variance of the accepting probability estimate.

Remark 3: When comparing the unbiased estimator

[TABLE]

with the consistent estimator

[TABLE]

we hope to outperform both the small and large $K$ cases in (10). For the small $K$ , we expect to reduce the variability in our acceptance probabilities, while for larger $K$ we expect to reduce the computational costs without sacrificing much in terms of precision.

Since the proposed weighted estimate is no longer an unbiased estimator of $h({\theta})$ , a new theoretical evaluation is needed to study the effect of perturbing the transition kernel on the statistical analysis. Central to the algorithm’s utility is the ability to control the total variation distance between the desired distribution of interest given in (7) and the modified chain’s target. As will be shown in Section 7, we rely on three assumptions to ensure that the chain would approximately sample from (7): 1) compactness of ${\Theta}$ ; 2) uniform ergodicity of the chain using the true $h$ and 3) uniform convergence in probability of $\hat{h}({\theta})$ to $h({\theta})$ as $N\to\infty$ .

The k-Nearest-Neighbor (kNN) regression approach (Fix and Hodges, 1951; Biau and Devroye, 2015) has a property of uniform consistency (Cheng, 1984). Define $K=g(N)$ (in our numerical experiments we have used $g(\cdot)=\sqrt{(\cdot)}$ ). Without loss of generality we relabel the elements of ${\mathcal{Z}}_{N}=\{\tilde{\zeta}_{n},\tilde{\delta}_{n}\}_{n=1}^{N}$ according to distance $\|\tilde{\zeta}_{n}-\zeta^{*}\|$ so that $(\tilde{\zeta}_{1},\tilde{\delta}_{1})$ and $(\tilde{\zeta}_{N},\tilde{\delta}_{N})$ corresponds to the smallest and largest among all distances $\{\|\tilde{\zeta}_{j}-\zeta^{*}\|:\;1\leq j\leq N\}$ , respectively. The kNN method sets $W_{Nn}({\zeta}^{*})$ to zero for all $n>K$ . For $n\leq K$ , we focus on the following two weighting schemes:

(U)

The uniform kNN with $W_{Nn}({\zeta}^{*})=1$ for all $n\leq K$ ; 2. (L)

The linear kNN with $W_{Nn}(\zeta^{*})=W(\|\tilde{\zeta}_{n}-\zeta^{*}\|)=1-\|\tilde{\zeta}_{n}-\zeta^{*}\|/\|\tilde{\zeta}_{K}-\zeta^{*}\|$ for $n\leq K$ so that the weight decreases from $1$ to [math] as $n$ increases from $1$ to $K$ .

The kNN’s theoretical properties that are used to validate our sampler rely on independence between the pairs $\{\tilde{\zeta}_{n},\tilde{\delta}_{n}\}_{n\geq 1}$ . Therefore, throughout the paper, we use an independent proposal in the MH sampler, i.e. $q(\cdot|{\theta}^{(t)})=q(\cdot)$ and $q$ is Gaussian. The entire procedure is outlined in Algorithm 4.

To conclude, at the end of a simulation of size $M$ the MCMC samples are $\{{\theta}^{(1)},\ldots,{\theta}^{(M)}\}$ and the history used for updating the chain is $\{(\tilde{\zeta}_{1},\tilde{\delta}_{1}),\ldots,(\tilde{\zeta}_{M},\tilde{\delta}_{M})\}$ . The two sequences are independent of one another, i.e. for any $N>0$ , the elements in ${\mathcal{Z}}_{N}$ are independent of the chain’s history up to time $N$ .

Note also that $h(\theta^{(t)})$ is required in order to determine the acceptance probability at step $t+1$ . In this case the $h$ -value may be updated if $\|{\theta}^{(t)}-\tilde{\zeta}^{*}\|$ is small enough.

In the next section we extend the approximate MCMC construction to Bayesian Synthetic Likelihood. In Sections 5 and 6 we use numerical experiments to show that the proposed procedure generally improves the mixing of a chain.

4 BSL and Approximated BSL (ABSL)

An alternative approach to bypass the intractability of the sampling distribution is proposed by Wood (2010). His approach is based on the assumption that the conditional distribution for a user-defined statistic $S({\mathbf{y}})$ given ${\theta}$ is Gaussian with mean $\mu_{{\theta}}$ and covariance matrix $\Sigma_{{\theta}}$ . The Synthetic Likelihood (SL) procedure assigns to each ${\theta}$ the likelihood $SL({\theta})={\cal{N}}(s_{0};\mu_{{\theta}},\Sigma_{{\theta}})$ , where $s_{0}=S({\mathbf{y}}_{0})$ and ${\cal{N}}(x;\mu,\Sigma)$ denotes the density of a normal with mean $\mu$ and covariance $\Sigma$ . SL can be used for maximum likelihood estimation as in Wood (2010) or within the Bayesian paradigm as proposed by Drovandi et al. (2018) and Price et al. (2018). The latter work proposes to sample the approximate posterior generated by the Bayesian Synthetic Likelihood (BSL) approach, $\pi({\theta}|s_{0})\propto p({\theta}){\mathcal{N}}(s_{0};\mu_{{\theta}},\Sigma_{{\theta}})$ , using a MH sampler. Direct calculation of the acceptance probability is not possible because the conditional mean and covariance are unknown for any $\theta$ . However, both can be estimated based on $m$ statistics $(s_{1},\cdots,s_{m})$ sampled from their conditional distribution given ${\theta}$ . More precisely, after simulating ${\mathbf{y}}_{i}\sim f({\mathbf{y}}|{\theta})$ and setting $s_{i}=S({\mathbf{y}}_{i})$ , $i=1,\cdots,m$ , one can estimate

[TABLE]

so that the synthetic likelihood is

[TABLE]

The pseudo-code in Algorithm 5 shows the steps involved in the BSL-MCMC sampler. Since each MH step requires calculating the likelihood ratios between two SLs calculated at different parameter values, one can anticipate the heavy computational load involved in running the chain for thousands of iterations, especially if sampling data ${\mathbf{y}}$ is expensive. Note that even though these estimates for the conditional mean and covariance are unbiased, the estimated value of the Gaussian likelihood is biased and therefore pseudo marginal MCMC theory is not applicable. Price et al. (2018) presented an unbiased Gaussian likelihood estimator and have empirically showed that using biased and unbiased estimates generally perform similarly. They have also remarked that this procedure is very robust to the number of simulations $m$ , and demonstrate empirically that using $m=50$ to $200$ produce similar results.

The normality assumption for summary statistics is certainly a strong assumption which may not hold in practice. Following up on this, An et al. (2018) relaxed the jointly Gaussian assumption to Gaussian copula with non-parametric marginal distribution estimates (NONPAR-BSL), which includes joint Gaussian as a special case, but is much more flexible. The estimation is based, as in the BSL framework, on $m$ pseudo-data samples simulated for each ${\theta}$ .

Clearly, BSL is computationally costly and requires many pseudo-data simulations to obtain Monte Carlo samples of even moderate sizes. To accelerate BSL-MCMC we propose to store and utilize past simulations of $({\zeta},s)$ to approximate $\mu_{\zeta^{*}},\Sigma_{\zeta^{*}}$ for any $\zeta^{*}\in{\Theta}$ , making the whole procedure computationally faster. As in the previous section, we separate the simulation used to update the chain from the simulation used to enrich the history of the chain. The approach can trivially be extended for NONPAR-BSL but we do not pursue it further here. K-Nearest-Neighbor (kNN) method is used as a non-parametric estimation tool for different quantities described above. As will be shown in Section 7 with the proposed method we can control the error between the intended stationary distribution and that of the proposed accelerated MCMC.

Approximated Bayesian Synthetic Likelihood (ABSL)

Setting $s_{0}=S({\mathbf{y}}_{0})$ and assuming conditional normally for this statistic the objective is to sample from

[TABLE]

During the MCMC run, the proposal $\zeta^{*}$ is generated from $q(\cdot)$ and the history ${\mathcal{Z}}_{N}$ is enriched using $\tilde{\zeta}^{*}\sim q(\cdot)$ , $\{\tilde{\mathbf{y}}^{*(j)}\}_{j=1}^{m}\stackrel{{\scriptstyle iid}}{{\sim}}f({\mathbf{y}}|\tilde{\zeta}^{*})$ and $\{\tilde{s}^{*(j)}=S(\tilde{\mathbf{y}}^{*(j)})\}_{j=1}^{m}$ . Then for any $\zeta$ , the conditional mean and covariance of statistics vector is estimated using past samples as weighted averages:

[TABLE]

Again the weights are functions of distance between proposed value and parameters’ values from the past $W_{Nn}(\zeta)=W(\|\zeta-\tilde{\zeta}_{n}\|)$ , where $\|\cdot\|$ is the Euclidean norm. To get appropriate convergence properties we use the kNN approach to calculate weights $W_{Nn}$ , where only the $K=\sqrt{N}$ closest values to $\zeta$ are used in the calculation of conditional means and covariances. As in the previous section, uniform (U) and linear (L) weights are used. Once again we expect that the use of the chain’s cumulated history can significantly speed up the whole procedure since it relieves the pressure to simulate many data sets ${\mathbf{y}}$ at every step. The use of the independent Metropolis kernel ensures that ${\mathcal{Z}}_{N}$ contains independent draws which is required for theoretical validation in Section 7. We will also show that under mild assumptions and if ${\Theta}$ is compact, the proposed algorithm exhibits good error control properties. In order to get a rough idea about the proposal, we propose to perform finite adaptation with $J$ adaptation points during the burn-in period. Algorithm 6 outlines the proposed Approximated BSL (ABSL) method. For the simulations we report on in the next section, we have used $c=1.5$ and $J=15$ to be consistent with AABC-MCMC, ABC-MCMC-M and ABC-SMC procedures.

5 Simulations

We analyze the following statistical models:

(MA2)

Simple Moving Average model of lag 2; 2. (R)

Ricker’s model; 3. (SVG)

Stochastic volatility with Gaussian emission noise; 4. (SVS)

Stochastic volatility with ${\alpha}$ -Stable errors.

For all these models, the simulation of pseudo data for any parameter is simple and computationally fast, but the use of standard estimation methods can be quite challenging, especially for (R), (SVG) and (SVS). For ABC samplers before running a MCMC chain we estimate initial and final thresholds ${\epsilon}_{0}$ and ${\epsilon}_{15}$ (15 equal steps in log scale were used for all models) and the matrix $A$ which is used to calculate the discrepancy $\delta=d(S({\mathbf{y}}),S({\mathbf{y}}_{0}))=(S({\mathbf{y}})-S({\mathbf{y}}_{0})^{T}A(S({\mathbf{y}})-S({\mathbf{y}}_{0}))$ .

To estimate $A$ , we use the following steps:

•

Set $A=\mathbf{I}_{d}$

•

Repeat steps I and II below for $J$ times ( $J$ =3 in our implementations)

I

Generate 500 pairs $\{{\zeta}_{i},{\mathbf{y}}_{i}\}_{i=1}^{500}$ from $p({\zeta})f({\mathbf{y}}|{\zeta})$ and calculate discrepancies $\{{\zeta}_{i},\delta_{i}\}_{i=1}^{500}$ with $\delta_{i}=d(S({\mathbf{y}}_{i}),S({\mathbf{y}}_{0}))$

II

Let ${\zeta}^{*}$ with smallest discrepancy. Finally generate 100 pseudo-data $({\mathbf{y}}_{1},\ldots,{\mathbf{y}}_{100})$ from $f({\mathbf{y}}|{\zeta}^{*})$ , compute corresponding summary statistics $(s_{1},\ldots,s_{100})$ and set $A$ to be the inverse of covariance matrix of $(s_{1},\ldots,s_{100})$ .

We set ${\epsilon}_{0}$ to be the 5% quantile of the observed discrepancies. The final ${\epsilon}_{15}$ is obtained by implementing a Random Walk version of Algorithm 3 and decreasing ${\epsilon}_{0}$ gradually by setting ${\epsilon}_{j}$ as the 1% quantile of discrepancies $\delta$ corresponding to accepted samples generated between adaption points $a_{j-1}$ and $a_{j}$ , for $2\leq j\leq 15$ .

The number of simulations was set to 500 and 100 just for computational convenience and is not driven by any theoretical arguments.

We compare the following algorithms:

(SMC)

Standard Sequential Monte Carlo for ABC; 2. (ABC-RW)

The modified ABC-MCMC algorithm which updates ${\epsilon}$ and the random walk Metropolis transition kernel during burn-in; 3. (ABC-IS)

The modified ABC-MCMC algorithm which updates ${\epsilon}$ and the Independent Metropolis transition kernel during burn-in; 4. (BSL-RW)

Modified BSL where it adapts the random walk Metropolis transition kernel during burn-in; 5. (BSL-IS)

Modified BSL where it adapts the independent Metropolis transition kernel during burn-in; 6. (AABC-U)

Approximated ABC-MCMC with independent proposals and uniform (U) weights; 7. (AABC-L)

Approximated ABC-MCMC with independent proposals and linear (L) weights; 8. (ABSL-U)

Approximated BSL-MCMC with independent proposals and uniform (U) weights; 9. (AABC-L)

Approximated BSL-MCMC with independent proposals and linear (L) weights. 10. (Exact)

Likelihood is computable and posterior samples are generated using an MCMC algorithm that is example-specific.

For SMC 500 particles were used, total number of iterations for ABC-RW, ABC-IS, AABC-U, AABC-L, ABSL-U and ABSL-L is 50000 with 10000 for burn-in. Since BSL-RW and BSL-IS are much more computationally expensive, total number of iterations were fixed at 10000 with 2000 burn-in and 50 pseudo-data simulations for every proposed parameter value (i.e. $m=50$ ). The Exact chain was run for 5000 iterations and 2000 for burn-in. It must be pointed out that all approximate samplers are based on the same summary statistics, same discrepancy function and the same ${\epsilon}$ sequence, so that they all start with the same initial conditions.

For more reliable results we compare these sampling algorithms under data set replications. In this study we set the number of replicates $R=100$ , so that for each model 100 data sets were generated and each one was analyzed with the described above sampling methods. Various statistics and measures of efficiency were calculated for every model and data set, letting ${\theta}_{rs}^{(t)}$ represent posterior samples from replicate $r=1,\cdots,R$ , iteration $t=1,\cdots,M$ and parameter component $s=1,\cdots,q$ and similarly $\tilde{\theta}_{rs}^{(t)}$ posterior from an exact chain (all draws are after burn-in period). We let ${\theta}^{true}_{s}$ denote the true parameter that generated the data. Moreover let $D_{rs}(x)$ , $\tilde{D}_{rs}(x)$ be estimated density function at replicate $r=1,\cdots,R$ and components $s=1,\cdots,q$ for approximate and exact chains respectively. Then the following quantities are defined:

[TABLE]

where $Mean_{t}(a_{st})$ is defined as average of $\{a_{st}\}$ over index $t$ and in similar manner $Var_{t}(a_{st})$ and $Cov_{t}(a_{st})$ representing variance and covariance respectively. The first three measures are useful in determining how close posterior draws from different samplers are to the draws generated by the exact chain (when it is available). On the other hand the last three are standard quantities that measure how close in mean square posterior means are to the true parameters that generated the data. To study efficiency of proposed algorithms we need to take into account CPU time that it takes to run a chain as well as auto-correlation properties. Define auto-correlation time (ACT) for every parameter’s component and replicate of samples ${\theta}_{rs}^{(t)}$ as:

[TABLE]

where $\rho_{a}$ is auto-correlation coefficient at lag $a$ . In practice we sum all the lags up to the first negative correlation. Letting $M-B$ to be number of chain iterations (after burn-in) and $CPU_{r}$ correspond to total CPU time to run the whole chain during replicate $r$ , we use Effective Sample Size (ESS) and Effective Sample Size per CPU (ESS/CPU) as:

[TABLE]

Note that these indicators are averaged over parameter components and replicates. ESS intuitively can be thought as approximate number of ”independent” samples out of $M-B$ , the higher is ESS the more efficient is the sampling algorithm, when ESS is combined with CPU (ESS/CPU) it provides a powerful indicator for MCMC’s efficiency. Generally a sampler with highest ESS/CPU is preferred as it produces larger number of ”independent” draws per unit time.

5.1 Moving Average Model

A popular toy example to check performances of ABC and BSL techniques is MA2 model:

[TABLE]

The data are represented by the sequence ${\mathbf{y}}=\{y_{1},\cdots,y_{n}\}$ . It is well known that $Y_{i}$ follow a stationary distribution for any ${\theta}_{1},{\theta}_{2}$ , but there are conditions required for identifiability. Hence, we impose uniform prior on the following set:

[TABLE]

It is very easy to see that the joint distribution of ${\mathbf{y}}$ is multivariate Gaussian with mean 0, diagonal variances $1+{\theta}_{1}^{2}+{\theta}_{2}^{2}$ , covariance at lags 1 and 2, ${\theta}_{1}+{\theta}_{1}{\theta}_{2}$ and ${\theta}_{2}$ respectively and zero at other lags. In this case, (Exact) sampling is feasible. For simulations we set $\{{\theta}_{1}=0.6,{\theta}_{2}=0.6\}$ , $n=200$ and define summary statistics $S({\mathbf{y}})=(\hat{\gamma}_{0}({\mathbf{y}}),\hat{\gamma}_{1}({\mathbf{y}}),\hat{\gamma}_{2}({\mathbf{y}}))$ as sample variance and covariances at lags 1 and 2. First we show results based on one replicate. Figure 1 shows the trace plots, histograms and auto-correlation functions estimated from posterior draws for parameters $\theta_{1}$ and $\theta_{2}$ for the AABC-U sampler. Note that only post burn-in samples are shown.

Similarly, Figure 2 and Figure 3 display the behaviour of ABSL-U sampler and standard ABC-RW, respectively. From these plots it is apparent that the proposed AABC-U and ABSL-U have much better mixing than ABC-RW. In the interest of keeping the paper length within reasonable limits, we briefly mention that additional simulations suggest that AABC-L is similar to AABC-U and ABSL-L to ABSL-U, while ABC-IS is outperformed by ABC-RW.

In order to summarize and compare the information in the MCMC draws produced by the approximated samplers and the exact chain, we plot the estimated densities in Figure 4. The left and right side plots refer to $\theta_{1}$ and $\theta_{2}$ , respectively. The two upper plots compare the estimated density of the exact MCMC sampler with ABC-based ones (SMC, ABC-RW and AABC-U), while the two lower plots compare the exact sampler with Synthetic Likelihood based methods (BSL-IS and ABSL-U).

The posterior distributions evaluated from AABC-U is very similar to those produced by SMC and ABC-RW, but all are distinct from the Exact one. This latter difference may be due to the loss of information incurred when the posterior is conditional on a non-sufficient statistic. Similarly, the distribution produced by ABSL-U draws is very close to that of BSL-IS. These observations hold for both components, ${\theta}_{1}$ and ${\theta}_{2}$ .

To study accuracy, precision and efficiency of proposed samplers we perform a simulation study where 100 data sets are generated and all samplers are run for every data set. The results are summarized in Table 1.

Examining this table we immediately note that ESS/CPU measure is much larger for proposed algorithms than for standard methods. The improvement is very substantial, for example ESS/CPU for AABC-U is 12 times larger than for the best standard ABC procedures like SMC. Similar results are shown for Bayesian Synthetic Likelihood. We also examine DIM, DIC, TV and MSE quantities that provide information about the proximity of approximate samples to the exact MCMC ones. For all these quantities the smaller the value the better is the sampler. We see that all these measures for AABC-U and AABC-L are very similar to SMC, ABC-RW and ABC-IS and frequently outperforms them. Similarly for BSL approach. Another observation is that the approximated algorithm with uniform and linear weights generally perform very similarly.

5.2 Ricker’s Model

Ricker’s model is analyzed very frequently to test Synthetic Likelihood procedures Wood (2010); Price et al. (2018). It is a particular instance of hidden Markov model:

[TABLE]

where $Pois(\lambda)$ is Poisson distribution with mean parameter $\lambda$ and $n=100$ . Only ${\mathbf{y}}=(y_{1},\cdots,y_{n})$ sequence is observed, because the first 50 values are ignored. Note that all parameters ${\theta}=({\theta}_{1},{\theta}_{2},{\theta}_{3})$ are unrestricted, the prior is given as (each prior parameter is independent):

[TABLE]

We restrict the range of ${\theta}_{2}$ as all algorithms become unstable for ${\theta}_{2}$ outside this interval. Note that the marginal distribution of ${\mathbf{y}}$ is not available in closed form, but transition distribution of hidden variables $X_{i}|x_{i-1}$ and emission probabilities $Y_{i}|x_{i}$ are known and hence we can run Particle MCMC (PMCMC) Andrieu et al. (2010) or Ensemble MCMC Shestopaloff and Neal (2013) to sample from the posterior distribution $\pi({\theta}|{\mathbf{y}}_{0})$ . Here we are utilizing the Particle MCMC with 100 particles. As suggested in Wood (2010) we set ${\theta}_{0}=(\log(3.8),0.9,2.3)$ and define summary statistics $S({\mathbf{y}})$ as the 14-dimensional vector whose components are:

(C1)

# $\{i:y_{i}=0\}$ , 2. (C2)

Average of ${\mathbf{y}}$ , $\bar{y}$ , 3. (C3:C7)

Sample auto-correlations at lags 1 through 5, 4. (C8:C11)

Coefficients $\beta_{0},\beta_{1},\beta_{2},\beta_{3}$ of cubic regression

$(y_{i}-y_{i-1})=\beta_{0}+\beta_{1}y_{i}+\beta_{2}y_{i}^{2}+\beta_{3}y_{i}^{3}+{\epsilon}_{i}$ , $i=2,\ldots,n$ , 5. (C12-C14)

Coefficients $\beta_{0},\beta_{1},\beta_{2}$ of quadratic regression

$y_{i}^{0.3}=\beta_{0}+\beta_{1}y_{i-1}^{0.3}+\beta_{2}y_{i-1}^{0.6}+{\epsilon}_{i}$ , $i=2,\ldots,n$ .

Figures 5, 6 and 7 show trace-plots, histograms and ACF function for AABC-U, ABSL-U and ABC-RW samplers for each component (red lines correspond to the true parameter).

We show here ABC-RW instead of ABC-IS because the latter exhibits a poorer performance. The main observation is that mixing of AABC-U is much better than in ABC-RW with smaller auto-correlation values. ABSL-U has higher auto-correlations than AABC-U but still performs quite well. To see how close the draws from simulation-based algorithms to the draws from the Exact chain, we plot the estimated approximate posterior marginal densities in Figure 8. The two upper plots (left and right are associated to parameter’s component) compares estimated density of exact PMCMC sampler (with 100 particles) with ABC-based ones (SMC, ABC-RW and AABC-U), two lower plots compare the Exact sampler with Synthetic Likelihood based methods (BSL-RW and ABSL-U).

Note that ABC-based samplers (SMC, ABC-RW and AABC-U) have very similar estimated densities. The densities of Synthetic Likelihood methods are also similar. For the second component there is a large difference between exact and approximate posteriors which may be caused by the loss of information induced by the choice of summary statistics.

A more general study, where results are averaged over 100 independent replicates, is shown in Table 2.

Again, the proposed strategies clearly outperform in terms of overall efficiency (ESS/CPU). For instance, AABC-U is about 10 times more efficient than standard SMC and ABSL-U is 6 times more efficient than BSL-RW. At the same time DIM, DIC, TV and MSE are generally smaller for approximate methods than for standard ones.

5.3 Stochastic Volatility with Gaussian emissions

When analyzing stationary time series, it is frequently observed that there are periods of high and periods of low volatility. Such phenomenon is called volatility clustering, see for example (Lux and Marchesi, 2000). One way to model such a behaviour is through a Stochastic Volatility (SV) model, where variances of the observed time series depend on hidden states that themselves form a stationary time series. Consider the following model which depends on three parameters $({\theta}_{1},{\theta}_{2},{\theta}_{3})$ :

[TABLE]

Only ${\mathbf{y}}=(y_{1},\cdots,y_{n})$ is observed while $(x_{1},\cdots,x_{n})$ are hidden states. The parameter ${\theta}_{1}\in(-1,1)$ controls the auto-correlation of hidden states, while ${\theta}_{2}$ and ${\theta}_{3}$ are unrestricted and relate to the hidden states influence on the variability of the observed series. Given a hidden state, the distribution of the observed variable is normal which may not be appropriate in some examples. We introduce the following priors, independently for each parameter:

[TABLE]

We set the true parameters to $({\theta}_{1}=0.95,{\theta}_{2}=-2,{\theta}_{3}=-1)$ and length of the time series $n=500$ . We use Particle MCMC (PMCMC) as the Exact sampling scheme. Since pseudo-data sets can be easily generated for every parameter value, the SV is a good example to demonstrate the performances of the generative algorithms considered here. For summary statistics we use a 7-dimensional vector whose components are:

(C1)

# $\{i:y_{i}^{2}>\mbox{quantile}({\mathbf{y}}_{0}^{2},0.99)\}$ , 2. (C2)

Average of ${\mathbf{y}}^{2}$ , 3. (C3)

Standard deviation of ${\mathbf{y}}^{2}$ , 4. (C4)

Sum of the first 5 auto-correlations of ${\mathbf{y}}^{2}$ , 5. (C5)

Sum of the first 5 auto-correlations of $\{{\mathbf{1}}_{\{y_{i}^{2}<\mbox{quantile}({\mathbf{y}}^{2},0.1)\}}\}_{i=1}^{n}$ , 6. (C6)

Sum of the first 5 auto-correlations of $\{{\mathbf{1}}_{\{y_{i}^{2}<\mbox{quantile}({\mathbf{y}}^{2},0.5)\}}\}_{i=1}^{n}$ , 7. (C7)

Sum of the first 5 auto-correlations of $\{{\mathbf{1}}_{\{y_{i}^{2}<\mbox{quantile}({\mathbf{y}}^{2},0.9)\}}\}_{i=1}^{n}$ .

Here $\mbox{quantile}({\mathbf{y}},\tau)$ is defined as $\tau$ -quantile of the sequence ${\mathbf{y}}$ . As was shown in Schmitt et al. (2015) and Dette et al. (2015) the auto-correlation of indicators (under different quantiles) can be very useful in characterizing a time series and that is why we have added (C5),(C6) and (C7) to the summary statistic. We focus here on ${\mathbf{y}}^{2}$ and its auto-correlations since model parameters only affect variability of ${\mathbf{y}}$ (auto-correlation of ${\mathbf{y}}$ is zero for any lag). Figures 9, 10 and 11 show trace-plots, histograms and ACF function for AABC-U, ABSL-U and ABC-RW samplers respectively for each component (red lines correspond to the true parameter).

The major observation is that AABC-U and ABSL-U are less sluggish than ABC-RW, exhibiting smaller auto-correlation values.

In Figure 12 we compare the sample-based kernel smoothing posterior marginal density estimates for Exact, SMC, ABC-RW and AABC-U (top row) as well as Exact, BSL-IS and ABSL-U (bottom row).

We note that all samples obtained from the approximate algorithms are exact posterior (produced using PMCMC with 100 particles). Generally all ABC-based samplers perform similarly, on the other hand ABSL-U performs worse than generic BSL-IS in this run as it is shifted away from the exact posterior for ${\theta}_{1}$ and ${\theta}_{3}$ .

To get more general conclusions we show average results in Table 3 over 100 data replicates.

Again we note that the proposed algorithms outperform the benchmark samplers by 8 times in ESS/CPU. Moreover AABC-U and AABC-L have very similar or smaller values for DIM, TV and MSE, which demonstrates that these samplers are much more efficient than standard methods and at the same produce as accurate (or more accurate) parameter estimates as generic algorithms.

ABSL-U and ABSL-L on the other hand did not perform well for this model, TV and MSE for these samplers are larger by 10% than generic ones.

5.4 Stochastic Volatility with ${\alpha}$ -Stable errors

As was pointed out in the previous sub-section, standard SV model assumes that the conditional distribution of the observed variables is Gaussian. Frequently, in financial time series, a large sudden drop occurs, thus raising serious doubts about the latter assumption. Often, it is suggested to use heavy tailed distributions (instead of Gaussian) to model financial data. We consider a family of distributions named ${\alpha}$ -Stable, denoted $Stab({\alpha},\beta)$ , with two parameters ${\alpha}\in(0,2]$ (stability parameter) and $\beta\in[-1,1]$ (skew parameter). Two special cases are ${\alpha}=1$ and ${\alpha}=2$ which correspond to Cauchy and Gaussian distribution respectively, note that for ${\alpha}<2$ the distribution has undefined variance. We define the following SV model with ${\alpha}$ -Stable errors with parameter ${\mathbf{{\theta}}}=({\theta}_{1},{\theta}_{2},{\theta}_{3},{\theta}_{4})^{T}\in{\mathbf{R}}^{4}$ :

[TABLE]

This model is very similar to the simple SV with only difference that emission errors follow ${\alpha}$ -Stable distribution with unknown stable parameter and fixed skew of $-1$ . We generally prefer negative skew emission probability to model large negative financial returns. As in the previous simulation example ${\theta}_{2}$ and ${\theta}_{3}$ are unrestricted. The prior distribution for this model is (independently for each parameter):

[TABLE]

We set the true parameters to ${\theta}_{1}=0.95,{\theta}_{2}=-2,{\theta}_{3}=-1,{\theta}_{4}=1.8$ and length of the time series $n=500$ . The major challenge with this model is that there are no closed-form densities for ${\alpha}$ -Stable distributions. Hence, most MCMC samplers, including PMCMC and ensemble MCMC, cannot be used to sample from the posterior. However sampling from this family of distributions is feasible which makes it particularly amenable for simulation based methods like ABC and BSL. For summary statistics we use a 7-dimensional vector whose components are:

(C1)

# $\{i:y_{i}^{2}>\mbox{quantile}({\mathbf{y}}_{0}^{2},0.99)\}$ , 2. (C2)

Average of ${\mathbf{y}}^{2}$ , 3. (C3)

Standard deviation of ${\mathbf{y}}^{2}$ , 4. (C4)

Sum of the first 5 auto-correlations of ${\mathbf{y}}^{2}$ , 5. (C5)

Sum of the first 5 auto-correlations of $\{{\mathbf{1}}_{\{y_{i}^{2}<\mbox{quantile}({\mathbf{y}}^{2},0.1)\}}\}_{i=1}^{n}$ , 6. (C6)

Sum of the first 5 auto-correlations of $\{{\mathbf{1}}_{\{y_{i}^{2}<\mbox{quantile}({\mathbf{y}}^{2},0.5)\}}\}_{i=1}^{n}$ , 7. (C7)

Sum of the first 5 auto-correlations of $\{{\mathbf{1}}_{\{y_{i}^{2}<\mbox{quantile}({\mathbf{y}}^{2},0.9)\}}\}_{i=1}^{n}$ .

Figures 13,14 and 15 show trace-plots, histograms and ACF function for AABC-U, ABSL-U and ABC-RW samplers respectively for each component (red lines correspond to the true parameters).

As in previous examples the mixing of AABC-U and ABSL-U is much better than of ABC-RW. Since exact sampling is not feasible in this example we compare samplers to SMC (instead of exact samples), the plotted estimated densities are in Figure 16, here we have chosen BSL-IS over BSL-RW because it has better general performance in this model.

Generally all simulation-based samplers have similar densities in this example.

For more general conclusions we show average results in Table 4 over 100 data replicates. Here to calculate DIM, DIC and TV, samplers are compared to SMC since exact draws cannot be obtained.

As in previous examples ESS/CPUs for AABC-U, AABC-L, ABSL-U and ABSL-L are roughly 8 times larger than benchmark algorithms. For this example looking at DIM, DIC and TV maybe misleading since approximated samplers are compared to another approximate sampler. Much more informative is MSE measure, it is very similar across ABC-based and BSL-based algorithms. Therefore we can conclude that proposed samplers perform very well in this example.

6 Data Analysis

For real world example we consider Dow-Jones index daily log returns from January 1, 2010 until December 31, 2018. The data were downloaded from Yahoo Finance111https://ca.finance.yahoo.com/ website. Given a time series of prices $P_{i}$ , $i=1,\cdots,n$ , log returns are calculated in the following way:

[TABLE]

The resulting time series is of length 2262. To make log returns more suitable for analysis, we standardize $r_{t}$ by subtracting its mean and then multiply each return by 200, so that absolute values were not too small, Figure 17 shows transformed returns.

This time series ( ${\mathbf{y}}_{0}$ ) has mean zero by construction, and its auto-correlations and partial auto-correlations are insignificant for any lag. However, it is obvious that variances are correlated and there are alternating periods of low and high variability. This prompts us to use Stochastic Volatility model with ${\alpha}$ -Stable errors as described in the previous section. Since the likelihood does not exist for this class of models, the simulation-based methods are probably the only available tools for the inference. The evolution of time series is described by equation (23) and the parameter’s prior is set as in equation (24). The skewed parameter of Stable distribution is fixed at value of $-1$ . To estimate the posterior distribution we run AABC-U and ABLS-U samplers. The summary statistic for both methods is the same 7-dimensional vector defined in section 5.4. Each chain was run for 100 thousand iterations with last 80 thousands used for inference. Figures 18 and 19 show trace-plots and histograms for AABC-U and ABSL-U samplers respectively for each parameter.

The conclusions are in agreement with the ones suggested by the simulation study. The mixing of AABC-U is generally better than of ABSL-U. However, posterior draws of ABSL-U for the first 3 components are uni-modal, symmetric and bell-shaped, which is not surprising since the use of Gaussian priors within the BSL method yields Gaussian posteriors due to conjugacy. Table 5 reports posterior mean and 95% credible intervals for every parameter and for both samplers.

AABC-U and ABSL-U produce similar results. We see that the estimated correlation between adjacent variables in the hidden layer is about $0.9$ and the estimate of ${\alpha}$ -Stable emission noise is $1.91$ . This model can produce more extreme values than those predicted by one with standard Gaussian noise.

7 Theoretical Justifications

In this section we show that the novel approximated ABC MCMC and BSL samplers with independent proposals exhibit ergodic properties in a long run. In other words, we want to show that as number of MCMC iterations increases marginal distribution of $\{{\theta}^{(t)}\}$ converges to appropriate posterior distribution in total variation and sample averages converge to the true expectations.

We start by reviewing our notation. Let $p({\theta}),q({\theta})$ represent the prior and proposal distributions for ${\theta}\in{\Theta}$ respectively. For AABC we define a function $h({\theta})$ as $P(\delta<{\epsilon}|{\theta})$ where $\delta=\delta({\mathbf{y}},{\mathbf{y}}_{0})$ and ${\mathbf{y}}\sim f({\mathbf{y}}|{\theta})$ . Then given a proposed ${\zeta}^{*}$ the acceptance probability is:

[TABLE]

This MH procedure defines an exact transition kernel which we call $P(\cdot,\cdot)$ . Since $h({\theta})$ is not available in closed form we will estimate it using k-nearest-neighbor approach.

Let ${\mathcal{Z}_{N}}=\{\tilde{\zeta}_{n},{\mathbf{1}}_{\{\tilde{\delta}_{n}<{\epsilon}\}}\}_{n=1}^{N}$ represent $N$ independent samples from $q({\zeta})P({\mathbf{1}}_{\{\delta<{\epsilon}\}}|{\zeta})$ for AABC. Actually ${\mathcal{Z}_{N}}$ contains past generated samples that were saved before $N$ th iteration. Given ${\theta}$ and ${\zeta}^{*}$ we apply kNN to approximate $h({\theta})$ and $h({\zeta}^{*})$ by calculating local weighted averages of ${\mathbf{1}}_{\{\tilde{\delta}_{n}<{\epsilon}\}}$ for $\tilde{\zeta}_{n}$ that are close to ${\theta}$ or ${\zeta}^{*}$ . We denote such estimate $\hat{h}({\theta};{\mathcal{Z}_{N}})$ , and the probability of proposal acceptance for this perturbed algorithm (more on perturbed MCMC can be found in Roberts et al. (1998); Pillai and Smith (2014); Johndrow and Mattingly (2017)) is:

[TABLE]

The approximate kernel transition is $\hat{P}_{N}(\cdot,\cdot)=E_{{\mathcal{Z}_{N}}}\left[\hat{P}_{N}(\cdot,\cdot;{\mathcal{Z}_{N}})\right]$ , the goal is to show that as $N\to\infty$ the distance between this transition and the exact one converges to zero, where distance is defined as:

[TABLE]

where the last distance is ”total variation” distance between two measures. First we show that under strong consistency assumption of $\hat{h}({\theta};{\mathcal{Z}_{N}})$ , perturbed kernel converges to the exact one.

Theorem 7.1.

Suppose ${\Theta}$ is compact, $\sup_{{\theta}}\|\hat{h}({\theta};{\mathcal{Z}_{N}})-h({\theta})\|\to 0$ with probability 1 and $h({\theta})>0$ for all ${\theta}\in{\Theta}$ . Then for any ${\epsilon}>0$ there exists $C$ such that for all $N>C$ , $\|\hat{P}_{N}-P\|<{\epsilon}$ .

Next let ${\mathcal{P}}_{\epsilon}=\{\hat{P}_{N}:\|\hat{P}_{N}-P\|<{\epsilon}\}$ be a collection of perturbed kernels each ${\epsilon}$ distance from the exact kernel. For illustration consider an example when auxiliary set ${\mathcal{Z}_{N}}$ grows with number of iterations, in this case at each iteration a new kernel $\hat{P}_{N}\in{\mathcal{P}}_{\epsilon}$ is used in the chain. We want to show that this procedure will results in ergodic chain with appropriate convergence results. For most of the presented results below we refer to the work of Johndrow et al. (2015b) on convergence properties of perturbed kernels.

To obtain useful convergence results we need to make additional Doeblin Condition assumption about the exact kernel $P$ :

Definition 7.1 (Doeblin Condition).

Given a kernel $P$ , there exists $0<{\alpha}<1$ such that

[TABLE]

We also choose ${\epsilon}$ so that ${\alpha}^{*}={\alpha}+2{\epsilon}<1$ and ${\epsilon}<{\alpha}/2$ which by Remark 2.1 in Johndrow et al. (2015b) guarantees that every member of ${\mathcal{P}}_{\epsilon}$ satisfies Doeblin Condition with ${\alpha}={\alpha}^{*}$ and has a unique invariant measure. Thus we define the following 3 assumptions:

(A1)

Exact transition kernel $P$ satisfies satisfies the Doeblin Condition, 2. (A2)

For any $\hat{P}\in{\mathcal{P}}_{\epsilon}$ , $\|\hat{P}-P\|<{\epsilon}$ , 3. (A3)

${\epsilon}<\min({\alpha}/2,(1-{\alpha})/2)$ .

Now, let $\mu$ be invariant measure of the exact kernel $P$ , and the perturbed chain ${\theta}^{(0)},{\theta}^{(1)},\cdots,{\theta}^{(t)}$ is a Markov chain with ${\theta}^{(0)}\sim\nu=\mu_{0}$ . Also define marginal distribution of ${\theta}^{(t)}$ denoted by $\mu_{t}$ , $t=1,2,,$ and equal to $\mu_{t}=\nu\hat{P}_{0}\hat{P}_{1}\cdots\hat{P}_{t}$ with each $\hat{P}_{t}\in{\mathcal{P}}$ , $t=1,2,\cdots$ and $\hat{P}_{0}$ being identity transition (for convenience). First we need to examine the total variation distance between $\mu$ and average measure $\sum_{t=0}^{M-1}\mu_{t}/M$ , in other words:

[TABLE]

Then we have the following important convergence result:

Theorem 7.2.

Suppose that (A1), (A2) and (A3) are satisfied. Let $\nu$ be any probability measure on $({\Theta},{\mathcal{F}}_{0})$ , then

[TABLE]

which implies that this difference can be arbitrary small for sufficiently large $M$ and small enough ${\epsilon}$ .

Next we focus on the following mean squared error (MSE):

[TABLE]

where $f$ is bounded function and $\mu f=E_{\mu}[f({\theta})]$ . The main objective here is to find the upper bound for this MSE when perturbed MCMC is used and how it depends on the sample size $M$ . To obtain the main result we introduce the following lemma:

Lemma 7.3.

Suppose: (A2) and (A3) are satisfied; ${\theta}^{(0)}\sim\nu$ , where $\nu$ is a probability distribution; $\mu_{t}=\nu\hat{P}_{1}\cdots\hat{P}_{t}$ is the marginal distribution of ${\theta}^{(t)}$ , $t=1,2,\cdots$ . Let $f({\theta})$ and $g({\theta})$ be bounded functions with $|f|=\sup_{{\theta}}f({\theta})$ and $|g|=\sup_{{\theta}}g({\theta})$ . Then

[TABLE]

The next important convergence results follows (similar to Theorem 2.5 of Johndrow et al. (2015b)):

Theorem 7.4 (Approximation of MSE).

Suppose that (A1), (A2) and (A3) are satisfied. Let $\mu$ denote the invariant measure of $P$ , $f({\theta})$ be a bounded function and ${\theta}^{(0)}\sim\nu$ , where $\nu$ is a probability distribution . Then

[TABLE]

In other words this expectation can be made arbitrary small for sufficiently large $M$ and small enough ${\epsilon}$ .

Based on these theorems we obtain convergence results for AABC and ABSL algorithms. To that end, we consider the following assumptions:

(B1)

${\Theta}$ is a compact set.

(B2)

$q({\theta})>0$ continuous density of independent proposal distribution.

(B3)

$p({\theta})>0$ continuous density of prior distribution.

(B4)

$h({\theta})$ continuous function of ${\theta}$ .

(B5)

In kNN estimation assume that $K(N)=\sqrt{N}$ with uniform or linear weights.

(B6)

$E[s^{j}|{\theta}]$ and $E[s^{j}s^{k}|{\theta}]$ are continuous functions of ${\theta}$ for every $1\leq j,k\leq p$ with $s^{j}$ representing $j$ th component of summary statistic $s$ .

(B7)

$Var[s^{j}|{\theta}]$ and $Var[s^{j}s^{k}|{\theta}]$ are bounded functions.

(B8)

$|\Sigma_{{\theta}}|>a_{0}$ where $\Sigma_{{\theta}}=Var(s|{\theta})$ for every ${\theta}\in{\Theta}$ .

Theorem 7.5 (Ergodicity of AABC).

Consider the proposed AABC sampler with ${\epsilon}$ threshold and let: $p({\theta})$ denote the prior measure on ${\Theta}$ , ${\mathcal{Z}_{N}}$ denote simulated pairs $\{\tilde{\zeta}_{n},{\mathbf{1}}_{\{\tilde{\delta}_{n}<{\epsilon}\}}\}_{n=1}^{N}$ with $\tilde{\zeta}_{n}\sim q({\zeta})$ $\forall n$ . Assume (B1)-(B5)* hold. Then for sufficiently large $N$ (number of past simulations) and $M$ (number of chain iterations), assumptions (A1)-(A3) are satisfied and the results established in Theorems 7.2 and 7.4 follow.*

Corollary 7.5.1 (Ergodicity of ABSL).

Assume that (B1)-(B8)* hold. Let $p({\theta})$ be the prior distribution on ${\Theta}$ , $h({\theta})={\mathcal{N}}\left(s_{0};\mu_{{\theta}},\Sigma_{{\theta}}\right)$ , ${\mathcal{Z}_{N}}$ the set of simulated pairs $\{\tilde{\zeta}_{n},\{\tilde{s}_{n}^{(j)}\}_{j=1}^{m}\}_{n=1}^{N}$ . Then for sufficiently large $N$ (number of past simulations) and $M$ (number of chain iterations), assumptions (A1)-(A3) are satisfied and the results established in Theorems 7.2 and 7.4 follow.*

To prove the results above we will utilize the following two theorems, one is about the strong uniform consistency of kNN estimators the later one is about uniform ergodicity of Hastings algorithm with independent proposal.

Theorem 7.6 (Uniform Consistency of kNN - Cheng (1984)).

Given independent $\{\tilde{\zeta}_{n},\tilde{\delta}_{n}\}_{n=1}^{N}$ , let ${\Theta}$ be support of distribution of $\tilde{\zeta}$ , $h(\tilde{\zeta})=E(\tilde{\delta}|\tilde{\zeta})$ and $\hat{h}_{N}(\tilde{\zeta})=\sum_{j=1}^{N}W_{Nj}\tilde{\delta}_{j}$ (kNN estimator) (here $j$ are permuted indices that order distances between $\tilde{\zeta}_{n}$ and $\tilde{\zeta}$ from smallest to largest). Suppose weights $W_{Nj}$ satisfy

(i)

$\sum_{j=1}^{N}W_{Nj}=1$ ,

(ii)

$W_{Nj}=0$ * for $j>K$ , and $K=K(N)$ with $K\to\infty$ and $K/N\to 0$ ,*

(iii)

$\sup_{N}K\max_{j}W_{Nj}<\infty$ .

If

(i)

${\Theta}$ * is compact,*

(ii)

$h(\tilde{\zeta})$ * is continuous function,*

(iii)

$Var(\tilde{\delta}|\tilde{\zeta})$ * is bounded random variable,*

(iv)

$K(N)$ * satisfies $K/\sqrt{N}\log(N)\to\infty$ ,*

then $\sup_{\tilde{\zeta}\in{\Theta}}|\hat{h}_{N}(\tilde{\zeta})-h(\tilde{\zeta})|\to 0$ with probability 1.

Note that the uniform and linear weights satisfy $W_{Nj}$ assumptions above.

Theorem 7.7 (Independent Metropolis sampler - Mengersen et al. (1996)).

*Suppose ${\theta}^{(t)}$ is a MH Markov Chain with invariant distribution $\pi({\theta})$ , independent proposal $q({\theta})$ and acceptance probabilities $a({\theta},{\zeta}^{*})=\min\left(1,\frac{\pi({\zeta}^{*})q({\theta})}{\pi({\theta})q({\zeta}^{*})}\right)$ .

If there exists $\beta>0$ such that $q({\theta})/\pi({\theta})>\beta$ for all ${\theta}\in{\Theta}$ , then the algorithm is uniformly ergodic so that $\|P^{n}({\theta},\cdot)-\pi\|_{TV}<(1-\beta)^{n}$ (here $P^{n}({\theta},\cdot)$ is conditional distribution of ${\theta}^{(n)}$ given ${\theta}^{(0)}={\theta}$ ).*

7.1 Proofs of theorems

Proof of Theorem 7.1.

Note that $\sup_{{\theta}}\|\hat{h}({\theta};{\mathcal{Z}_{N}})-h({\theta})\|\to 0$ w.p.1 implies that for all ${\theta}$ and ${\zeta}^{*}$ in ${\Theta}$ :

[TABLE]

therefore by Slutsky’s theorem we obtain

[TABLE]

for all $({\theta},{\zeta}^{*})$ in ${\Theta}\times{\Theta}$ . therefore

[TABLE]

Since $\min(1,x)$ is a continuous function, Continuous Mapping Theorem implies that

[TABLE]

Note that this not just a point-wise convergence, but uniform convergence in probability so that one $C$ will work for all $({\theta},{\zeta}^{*})$ . That is, for any $({\theta},{\zeta}^{*})$ , $\delta>0$ and ${\epsilon}>0$ there exists $C$ such that for all $N>C$ , $P(|\hat{a}({\theta},{\zeta}^{*};{\mathcal{Z}_{N}})-a({\theta},{\zeta}^{*})|>\delta)<{\epsilon}$ .

Another important observation is that (fixing ${\theta}$ , ${\zeta}^{*}$ and letting $a({\theta},{\zeta}^{*})=a$ and $\hat{a}({\theta},{\zeta}^{*};{\mathcal{Z}_{N}})=\hat{a}$ for convenience)

[TABLE]

Because $|\hat{a}-a|\leq 1$ and applying definition of convergence in probability. The above inequality shows that we can make this expected value arbitrary small by taking large enough $N$ , moreover this result is uniform so one $N$ will work for all ${\theta}$ and ${\zeta}^{*}$ .

Next we focus on the distance between two transition kernels, this discussion is similar to the proof of Corollary 2.3 in Alquier et al. (2016). Observe that (using independent proposals):

[TABLE]

where $r({\theta})=1-\int q({\zeta}^{*})a({\theta},{\zeta}^{*})d{\zeta}^{*}$ and $\hat{r}_{N}({\theta})=1-\int\int q({\zeta}^{*})a({\theta},{\zeta}^{*})d{\zeta}^{*}dF({\mathcal{Z}_{N}})$ . Fix ${\theta}\in\Theta$ , and noting that total variation between two probability distributions that have densities is also equal to:

[TABLE]

Therefore

[TABLE]

and it follows that

[TABLE]

for any ${\epsilon}>0$ and $\delta>0$ and large enough $N$ by (31). Since this result is true for any ${\theta}\in{\Theta}$ we finally get the main result:

[TABLE]

∎

Proof of Theorem 7.2.

We generally follow the proof of Theorem 2.4 in Johndrow et al. (2015b). First observe that:

[TABLE]

By Assumptions 2 and 3, we get:

[TABLE]

and

[TABLE]

Using these results, the triangular inequality and formula for sum of finite geometric series we establish that:

[TABLE]

Finally we get the main result using that fact that $\mu$ is invariant for $P$ (again using sum of finite geometric series)

[TABLE]

∎

Proof of Lemma 7.3.

Without loss of generality we assume that $k>j$ , next define:

[TABLE]

so that $E[\tilde{f}({\theta}^{(j)})]=E[\tilde{g}({\theta}^{(k)})]=0$ . Then we get the following

[TABLE]

where $\delta_{{\theta}}$ is point mass at ${\theta}$ and using our notation $\delta_{{\theta}^{(j)}}\hat{P}_{j+1}\cdots\hat{P}_{k}$ corresponds to conditional distribution of ${\theta}^{(k)}$ given fixed value of ${\theta}^{(j)}$ .

Using the general observation that for any two measures $\nu_{1}$ and $\nu_{2}$ and any bounded function $f$ the following inequality holds

[TABLE]

we find that:

[TABLE]

note that this result is for any ${\theta}^{(j)}\in{\Theta}$ . Returning to (37) we get that:

[TABLE]

Finally by triangular inequality $|\tilde{f}|\leq 2|f|$ for any $j=1,2,\cdots$ and similarly for $|\tilde{g}|$ . The desired result follows immediately. ∎

Proof of Theorem 7.4.

Using our standard notation $\nu\hat{P}_{0}\cdots\hat{P}_{t}f=E[f({\theta}^{(t)})]$ , Theorem 7.2, Lemma 7.3 and simple results for double sum of geometric series we get

[TABLE]

Obtaining the desired result. ∎

Proof of Theorem 7.5.

First by (B1) - (B4), Theorem 7.7 guarantees uniform ergodicity of the exact chain $P$ with $\beta=\min_{{\theta}\in{\Theta}}\frac{q({\theta})}{p({\theta})h({\theta})/c}$ where $c$ is the normalizing constant of the posterior. Note that $\beta>0$ since ${\Theta}$ is compact, ratio is continuous and never zero. Therefore $P$ also satisfies Doeblin Condition. Next from (B1), (B4) and (B5), Theorem 7.6 implies that $\sup_{{\theta}\in{\Theta}}\|\hat{h}({\theta};{\mathcal{Z}_{N}})-h({\theta})\|\to 0$ with probability 1. Hence by Theorem 7.1 perturbed kernel $\hat{P}$ can be made arbitrary close to the exact kernel $P$ for sufficiently large $N$ . Note that total variation distance between $\hat{P}_{N}$ and $P$ decreases to zero as $N$ increases. Finally assumptions of Theorems 7.2 and 7.4 follow trivially. ∎

Proof of Corollary 7.5.1.

First by (B1), (B2), (B3), (B4) and (B8), Theorem 7.7 guarantees uniform ergodicity of the exact chain $P$ with $\beta=\min_{{\theta}\in{\Theta}}\frac{q({\theta})}{p({\theta})h({\theta})/c}$ where $c$ is the normalizing constant of the posterior. Note that $\beta>0$ since ${\Theta}$ is compact, ratio is continuous and never zero. Therefore $P$ satisfies Doeblin Condition. Next from (B1), (B5), (B6) and (B7), Theorem 7.6 implies that $\sup_{{\theta}\in{\Theta}}\|\hat{h}({\theta};{\mathcal{Z}_{N}})-h({\theta})\|\to 0$ with probability 1. Hence by Theorem 7.1 perturbed kernel $\hat{P}$ can be made arbitrary close to the exact kernel $P$ for sufficiently large $N$ . Note that total variation distance between $\hat{P}_{N}$ and $P$ decreases to zero as $N$ increases. Finally assumptions of Theorems 7.2 and 7.4 follow trivially. ∎

8 Conclusion

In this paper we proposed to speed up generic ABC-MCMC and BSL algorithms by storing past simulations. This approach significantly accelerates the process and can be very useful for models where simulation of a pseudo data set is computationally expensive or when large number of MCMC iterations is required. We presented theoretical arguments and necessary assumptions for convergence properties of the perturbed chain. The performance of these strategies were examined via a series of simulations under different models. All simulations summaries show that proposed methods significantly improve mixing and efficiency of the chain and at the same time produce as accurate and precise parameter estimates as generic samplers.

Acknowledgments

We thank Jeffrey Rosenthal and Stanislav Volgushev for constructive comments. RVC is grateful to the organizers of the BIRS workshop “Validating and Expanding Approximate Bayesian Computation Methods” for creating a stimulating environment that generated ideas for this work. Finally, we acknowledge funding support from NSERC of Canada.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alquier et al. (2016) Alquier, P. , Friel, N. , Everitt, R. and Boland, A. (2016). Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels. Statistics and Computing 26 29–47.
2An et al. (2018) An, Z. , Nott, D. J. and Drovandi, C. (2018). Robust Bayesian synthetic likelihood via a semi-parametric approach. ar Xiv preprint ar Xiv:1809.05800 .
3Andrieu et al. (2010) Andrieu, C. , Doucet, A. and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 269–342.
4Andrieu and Roberts (2009) Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics 37 697–725.
5Balakrishnan et al. (2006) Balakrishnan, S. , Madigan, D. et al. (2006). A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets. Bayesian Analysis 1 345–361.
6Baragatti and Pudlo (2014) Baragatti, M. and Pudlo, P. (2014). An overview on approximate Bayesian computation. In ESAIM: Proceedings , vol. 44. EDP Sciences.
7Bardenet et al. (2014) Bardenet, R. , Doucet, A. and Holmes, C. (2014). Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In International Conference on Machine Learning (ICML) .
8Biau and Devroye (2015) Biau, G. and Devroye, L. (2015). Lectures on the nearest neighbor method . Springer.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Finding our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods

Abstract

1 Introduction

2 Approximate Bayesian Computation

3 Approximated ABC-MCMC (AABC-MCMC)

4 BSL and Approximated BSL (ABSL)

Approximated Bayesian Synthetic Likelihood (ABSL)

5 Simulations

5.1 Moving Average Model

5.2 Ricker’s Model

5.3 Stochastic Volatility with Gaussian emissions

5.4 Stochastic Volatility with α{\alpha}α-Stable errors

6 Data Analysis

7 Theoretical Justifications

Theorem 7.1**.**

Definition 7.1** (Doeblin Condition).**

Theorem 7.2**.**

Lemma 7.3**.**

Theorem 7.4** (Approximation of MSE).**

Theorem 7.5** (Ergodicity of AABC).**

Corollary 7.5.1** (Ergodicity of ABSL).**

Theorem 7.6** (Uniform Consistency of kNN - Cheng (1984)).**

Theorem 7.7** (Independent Metropolis sampler - Mengersen et al. (1996)).**

7.1 Proofs of theorems

Proof of Theorem 7.1.

Proof of Theorem 7.2.

Proof of Lemma 7.3.

Proof of Theorem 7.4.

Proof of Theorem 7.5.

Proof of Corollary 7.5.1.

8 Conclusion

Acknowledgments

5.4 Stochastic Volatility with ${\alpha}$ -Stable errors

Theorem 7.1.

Definition 7.1 (Doeblin Condition).

Theorem 7.2.

Lemma 7.3.

Theorem 7.4 (Approximation of MSE).

Theorem 7.5 (Ergodicity of AABC).

Corollary 7.5.1 (Ergodicity of ABSL).

Theorem 7.6 (Uniform Consistency of kNN - Cheng (1984)).

Theorem 7.7 (Independent Metropolis sampler - Mengersen et al. (1996)).