Bayesian Estimation of Economic Simulation Models using Neural Networks

Donovan Platt

arXiv:1906.04522·econ.GN·June 12, 2019

Bayesian Estimation of Economic Simulation Models using Neural Networks

Donovan Platt

PDF

TL;DR

This paper introduces a Bayesian estimation method using neural networks to approximate likelihood functions, improving accuracy in complex economic simulation models like agent-based and financial models.

Contribution

It presents a novel Bayesian estimation protocol leveraging deep neural networks to better estimate complex simulation models with intractable likelihoods.

Findings

01

The proposed method yields more accurate estimates across various models.

02

It effectively detects structural breaks and dynamic changes.

03

Benchmark comparisons show superior performance over existing methods.

Abstract

Recent advances in computing power and the potential to make more realistic assumptions due to increased flexibility have led to the increased prevalence of simulation models in economics. While models of this class, and particularly agent-based models, are able to replicate a number of empirically-observed stylised facts not easily recovered by more traditional alternatives, such models remain notoriously difficult to estimate due to their lack of tractable likelihood functions. While the estimation literature continues to grow, existing attempts have approached the problem primarily from a frequentist perspective, with the Bayesian estimation literature remaining comparatively less developed. For this reason, we introduce a Bayesian estimation protocol that makes use of deep neural networks to construct an approximation to the likelihood, which we then benchmark against a prominent…

Tables7

Table 1. Table 1: Estimation Result Summary for the Brock and Hommes ( 1998 ) Model

	$g_{2}$	$b_{2}$	$g_{3}$	$b_{3}$
Param Set $1$
$𝜽_{t r u e}$	$- 0.7$	$- 0.4$	$0.5$	$0.3$
MDN
$𝝁_{p o s t e r i o r}$	$- 0.6931$	$- 0.4048$	$0.5505$	$0.3160$
$𝝈_{p o s t e r i o r}$	$0.1681$	$0.0105$	$0.1864$	$0.0103$
$𝝈_{s a m p l i n g}$	$0.0051$	$0.0002$	$0.0055$	$0.0003$
$L S$	$0.0536$
KDE
$𝝁_{p o s t e r i o r}$	$- 0.5910$	$- 0.4004$	$0.4092$	$0.3083$
$𝝈_{p o s t e r i o r}$	$0.2787$	$0.0254$	$0.2603$	$0.0197$
$𝝈_{s a m p l i n g}$	$0.0089$	$0.0012$	$0.0130$	$0.0011$
$L S$	$0.1421$
Param Set $2$
$𝜽_{t r u e}$	$0.6$	$0.2$	$0.7$	$- 0.2$
MDN
$𝝁_{p o s t e r i o r}$	$0.6021$	$0.2401$	$0.7493$	$- 0.2304$
$𝝈_{p o s t e r i o r}$	$0.1804$	$0.0149$	$0.1662$	$0.0147$
$𝝈_{s a m p l i n g}$	$0.0116$	$0.0004$	$0.0090$	$0.0004$
$L S$	$0.0705$
KDE
$𝝁_{p o s t e r i o r}$	$0.4658$	$0.2410$	$0.6461$	$- 0.2330$
$𝝈_{p o s t e r i o r}$	$0.2803$	$0.0677$	$0.2571$	$0.0666$
$𝝈_{s a m p l i n g}$	$0.01693$	$0.0067$	$0.0145$	$0.0067$
$L S$	$0.1539$

Table 2. Table 2: Estimation Result Summary for the Random Walk Model (Increasing Volatility)

	$σ_{1}$	$σ_{2}$	$Δ_{σ}$
Param Set $1$
$𝜽_{t r u e}$	$1$	$2$	$1$
MDN
$𝝁_{p o s t e r i o r}$	$1.0585$	$1.9957$	$0.9372$
$𝝈_{p o s t e r i o r}$	$0.8153$	$0.6517$	$-$
$𝝈_{s a m p l i n g}$	$0.0137$	$0.0629$	$-$
$L S$		$0.0059$
KDE
$𝝁_{p o s t e r i o r}$	$0.9966$	$1.9084$	$0.9118$
$𝝈_{p o s t e r i o r}$	$0.4113$	$0.2719$	$-$
$𝝈_{s a m p l i n g}$	$0.0430$	$0.0197$	$-$
$L S$		$0.0092$
Param Set $2$
$𝜽_{t r u e}$	$1$	$2$	$1$
MDN
$𝝁_{p o s t e r i o r}$	$1.0205$	$1.9598$	$0.9393$
$𝝈_{p o s t e r i o r}$	$0.5660$	$0.4605$	$-$
$𝝈_{s a m p l i n g}$	$0.0216$	$0.0478$	$-$
$L S$		$0.0045$
KDE
$𝝁_{p o s t e r i o r}$	$0.9790$	$1.8930$	$0.9144$
$𝝈_{p o s t e r i o r}$	$0.0923$	$0.2141$	$-$
$𝝈_{s a m p l i n g}$	$0.0046$	$0.0169$	$-$
$L S$		$0.0109$

Table 3. Table 3: Estimation Result Summary for the Random Walk Model (Increasing Drift)

	$d_{1}$	$d_{2}$	$Δ_{d}$
Param Set $3$
$𝜽_{t r u e}$	$0.4$	$0.5$	$0.1$
MDN
$𝝁_{p o s t e r i o r}$	$0.4867$	$0.5465$	$0.0598$
$𝝈_{p o s t e r i o r}$	$0.0536$	$0.1139$	$-$
$𝝈_{s a m p l i n g}$	$0.0056$	$0.0038$	$-$
$L S$		$0.0984$
KDE
$𝝁_{p o s t e r i o r}$	$0.5204$	$0.3258$	$- 0.1945$
$𝝈_{p o s t e r i o r}$	$0.0578$	$0.1463$	$-$
$𝝈_{s a m p l i n g}$	$0.0032$	$0.0050$	$-$
$L S$		$0.2117$
Param Set $4$
$𝜽_{t r u e}$	$0.4$	$0.7$	$0.3$
MDN
$𝝁_{p o s t e r i o r}$	$0.5054$	$0.6876$	$0.1823$
$𝝈_{p o s t e r i o r}$	$0.0434$	$0.1131$	$-$
$𝝈_{s a m p l i n g}$	$0.0024$	$0.0036$	$-$
$L S$		$0.1061$
KDE
$𝝁_{p o s t e r i o r}$	$0.5308$	$0.5033$	$- 0.0275$
$𝝈_{p o s t e r i o r}$	$0.0561$	$0.1457$	$-$
$𝝈_{s a m p l i n g}$	$0.0025$	$0.0041$	$-$
$L S$		$0.2362$

Table 4. Table 4: Estimation Result Summary for the Random Walk Model (Decreasing Drift)

	$d_{1}$	$d_{2}$	$Δ_{d}$
Param Set $5$
$𝜽_{t r u e}$	$0.5$	$0.4$	$- 0.1$
MDN
$𝝁_{p o s t e r i o r}$	$0.5691$	$0.4743$	$- 0.0949$
$𝝈_{p o s t e r i o r}$	$0.0485$	$0.1348$	$-$
$𝝈_{s a m p l i n g}$	$0.0031$	$0.0039$	$-$
$L S$		$0.1015$
KDE
$𝝁_{p o s t e r i o r}$	$0.6015$	$0.2611$	$- 0.3404$
$𝝈_{p o s t e r i o r}$	$0.0573$	$0.1396$	$-$
$𝝈_{s a m p l i n g}$	$0.0039$	$0.0032$	$-$
$L S$		$0.1720$
Param Set $6$
$𝜽_{t r u e}$	$0.7$	$0.4$	$- 0.3$
MDN
$𝝁_{p o s t e r i o r}$	$0.7585$	$0.4400$	$- 0.3185$
$𝝈_{p o s t e r i o r}$	$0.0532$	$0.1526$	$-$
$𝝈_{s a m p l i n g}$	$0.0033$	$0.0029$	$-$
$L S$		$0.0709$
KDE
$𝝁_{p o s t e r i o r}$	$0.7838$	$0.2934$	$- 0.4904$
$𝝈_{p o s t e r i o r}$	$0.0564$	$0.1469$	$-$
$𝝈_{s a m p l i n g}$	$0.0027$	$0.0030$	$-$
$L S$		$0.1356$

Table 5. Table 5: Estimation Result Summary for the Franke and Westerhoff ( 2012 ) Model

	$α_{0}$	$α_{n}$	$α_{p}$	$σ_{c}$
Param Set HPM
$𝜽_{t r u e}$	$- 0.327$	$1.79$	$18.43$	$2.087$
MDN
$𝝁_{p o s t e r i o r}$	$- 0.1749$	$1.8987$	$17.1821$	$2.3113$
$𝝈_{p o s t e r i o r}$	$0.1297$	$0.1697$	$2.2932$	$0.3548$
$𝝈_{s a m p l i n g}$	$0.0036$	$0.0232$	$0.0410$	$0.0130$
$L S$	$0.1210$
KDE
$𝝁_{p o s t e r i o r}$	$- 0.1287$	$1.7968$	$16.2177$	$2.3134$
$𝝈_{p o s t e r i o r}$	$0.1667$	$0.2880$	$3.1280$	$0.5547$
$𝝈_{s a m p l i n g}$	$0.0139$	$0.0105$	$0.2356$	$0.05105$
$L S$	$0.15534$
	$α_{w}$	$η$	$σ_{c}$
Param Set WP
$𝜽_{t r u e}$	$2668$	$0.987$	$1.726$
MDN
$𝝁_{p o s t e r i o r}$	$1993.1311$	$0.9078$	$1.6991$
$𝝈_{p o s t e r i o r}$	$2195.8553$	$0.0799$	$0.4335$
$𝝈_{s a m p l i n g}$	$184.4589$	$0.0043$	$0.0364$
$L S$	$0.0912$
KDE
$𝝁_{p o s t e r i o r}$	$2437.1697$	$0.6263$	$1.4567$
$𝝈_{p o s t e r i o r}$	$2831.5574$	$0.2846$	$0.3403$
$𝝈_{s a m p l i n g}$	$458.0461$	$0.0257$	$0.0296$
$L S$	$0.3650$

Table 6. Table 6: Estimation Result Summary Across All Models

Outcome	Percentage of Cases
$L S_{m d n} < L S_{k d e}$	$100$
$\| μ_{m d n}^{i} - θ_{t r u e}^{i} \| < \| μ_{k d e}^{i} - θ_{t r u e}^{i} \|$	$81.48$
$σ_{m d n}^{i} < σ_{k d e}^{i}$	$77.78$

Table 7. Table 7: Estimation Result Summary Across All Models for L = 4 𝐿 4 L=4

Outcome	Percentage of Cases
$L S_{m d n} < L S_{k d e}$	$100$
$\| μ_{m d n}^{i} - θ_{t r u e}^{i} \| < \| μ_{k d e}^{i} - θ_{t r u e}^{i} \|$	$77.78$
$σ_{m d n}^{i} < σ_{k d e}^{i}$	$74.07$

Equations72

X^{s im} (θ, T, i) = [x_{1, i}^{s im} (θ), x_{2, i}^{s im} (θ), \dots, x_{T, i}^{s im} (θ)],

X^{s im} (θ, T, i) = [x_{1, i}^{s im} (θ), x_{2, i}^{s im} (θ), \dots, x_{T, i}^{s im} (θ)],

X = [x_{1}, x_{2}, \dots, x_{T}],

X = [x_{1}, x_{2}, \dots, x_{T}],

p (θ ∣ X) = \frac{p ( X ∣ θ ) p ( θ )}{p ( X )} .

p (θ ∣ X) = \frac{p ( X ∣ θ ) p ( θ )}{p ( X )} .

p (θ ∣ X) \propto p (X ∣ θ) p (θ) .

p (θ ∣ X) \propto p (X ∣ θ) p (θ) .

p (X ∣ θ) = t = 1 \prod T \tilde{f} (x_{t} ∣ θ) .

p (X ∣ θ) = t = 1 \prod T \tilde{f} (x_{t} ∣ θ) .

p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{1,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right)=p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right)

p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{1,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right)=p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right)

\tilde{f}\left(\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim},\bm{x}_{t,i}^{sim},\bm{\phi}\right)\simeq p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right),

\tilde{f}\left(\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim},\bm{x}_{t,i}^{sim},\bm{\phi}\right)\simeq p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right),

\tilde{f}\left(\bm{x},\bm{y},\bm{\phi}\right)=\sum_{k=1}^{K}\alpha_{k}\left(\bm{x}\right)\mathcal{N}\left(\bm{y}\big{|}\bm{\mu}_{k}\left(\bm{x}\right),\bm{\Sigma}_{k}\left(\bm{x}\right)\right),

\tilde{f}\left(\bm{x},\bm{y},\bm{\phi}\right)=\sum_{k=1}^{K}\alpha_{k}\left(\bm{x}\right)\mathcal{N}\left(\bm{y}\big{|}\bm{\mu}_{k}\left(\bm{x}\right),\bm{\Sigma}_{k}\left(\bm{x}\right)\right),

p (X ∣ θ) = t = 1 \prod T - L \tilde{f} (x_{t}, \dots, x_{t + L - 1}, x_{t + L}, ϕ) .

p (X ∣ θ) = t = 1 \prod T - L \tilde{f} (x_{t}, \dots, x_{t + L - 1}, x_{t + L}, ϕ) .

L S (θ^{t r u e}, \hat{θ}) = ∣∣ θ^{t r u e} - \hat{θ} ∣ ∣_{2},

L S (θ^{t r u e}, \hat{θ}) = ∣∣ θ^{t r u e} - \hat{θ} ∣ ∣_{2},

\hat{θ}_{j}^{[0, 1]} = \frac{θ ^ _{j} - a}{b - a},

\hat{θ}_{j}^{[0, 1]} = \frac{θ ^ _{j} - a}{b - a},

y_{t + 1}

y_{t + 1}

n_{h, t + 1}

U_{h, t}

p_{t} = y_{t} + p^{*} .

p_{t} = y_{t} + p^{*} .

x_{t + 1} = x_{t} + d_{t + 1} + ϵ_{t + 1}, ϵ_{t} \sim N (0, σ_{t}^{2}),

x_{t + 1} = x_{t} + d_{t + 1} + ϵ_{t + 1}, ϵ_{t} \sim N (0, σ_{t}^{2}),

d_{t}, σ_{t} = {d_{1}, σ_{1} d_{2}, σ_{2} t \leq τ t > τ .

d_{t}, σ_{t} = {d_{1}, σ_{1} d_{2}, σ_{2} t \leq τ t > τ .

p_{t}

p_{t}

d_{t}^{f}

d_{t}^{c}

n_{t}^{f}

n_{t}^{c}

a_{t} = α_{n} (n_{t}^{f} - n_{t}^{c}) + α_{0} + α_{p} (p_{t} - p^{*})^{2},

a_{t} = α_{n} (n_{t}^{f} - n_{t}^{c}) + α_{0} + α_{p} (p_{t} - p^{*})^{2},

g_{t}^{s}

g_{t}^{s}

w_{t}^{s}

a_{t}

p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right)\simeq p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{t-L-1,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right),

p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right)\simeq p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{t-L-1,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right),

\begin{split}\bm{X}^{train}_{i}(\bm{\theta})=\Big{\{}&\left\{\bm{x}^{sim}_{1,i}(\bm{\theta}),\dots,\bm{x}^{sim}_{L,i}(\bm{\theta})\right\},\left\{\bm{x}^{sim}_{2,i}(\bm{\theta}),\dots,\bm{x}^{sim}_{L+1,i}(\bm{\theta})\right\},\dots,\\ &\left\{\bm{x}^{sim}_{T-L,i}(\bm{\theta}),\dots,\bm{x}^{sim}_{T-1,i}(\bm{\theta})\right\}\Big{\}},\end{split}

\begin{split}\bm{X}^{train}_{i}(\bm{\theta})=\Big{\{}&\left\{\bm{x}^{sim}_{1,i}(\bm{\theta}),\dots,\bm{x}^{sim}_{L,i}(\bm{\theta})\right\},\left\{\bm{x}^{sim}_{2,i}(\bm{\theta}),\dots,\bm{x}^{sim}_{L+1,i}(\bm{\theta})\right\},\dots,\\ &\left\{\bm{x}^{sim}_{T-L,i}(\bm{\theta}),\dots,\bm{x}^{sim}_{T-1,i}(\bm{\theta})\right\}\Big{\}},\end{split}

Y_{i}^{t r ain} (θ) = {x_{L + 1, i}^{s im} (θ), x_{L + 2, i}^{s im} (θ), \dots, x_{T, i}^{s im} (θ)} .

Y_{i}^{t r ain} (θ) = {x_{L + 1, i}^{s im} (θ), x_{L + 2, i}^{s im} (θ), \dots, x_{T, i}^{s im} (θ)} .

α = so f t ma x (W_{α} h_{n} + b_{α}),

α = so f t ma x (W_{α} h_{n} + b_{α}),

μ_{k} = W_{μ_{k}} h_{n} + b_{μ_{k}},

μ_{k} = W_{μ_{k}} h_{n} + b_{μ_{k}},

Σ_{k} = d ia g (σ_{k}^{2}),

Σ_{k} = d ia g (σ_{k}^{2}),

lo g σ_{k}^{2} = W_{σ_{k}} h_{n} + b_{σ_{k}} .

lo g σ_{k}^{2} = W_{σ_{k}} h_{n} + b_{σ_{k}} .

ϕ = {ψ, W_{α}, b_{α}, W_{μ_{k}}, b_{μ_{k}}, W_{σ_{k}}, b_{σ_{k}}}

ϕ = {ψ, W_{α}, b_{α}, W_{μ_{k}}, b_{μ_{k}}, W_{σ_{k}}, b_{σ_{k}}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\clearscrheadfoot

\cfoot

[\pagemark]\pagemark

\spacedallcapsBayesian Estimation of Economic Simulation Models using Neural Networks

\spacedlowsmallcapsDonovan Platt Corresponding author, [email protected] Mathematical Institute, University of Oxford

Institute for New Economic Thinking (INET) at the Oxford Martin School

Abstract

Recent advances in computing power and the potential to make more realistic assumptions due to increased flexibility have led to the increased prevalence of simulation models in economics. While models of this class, and particularly agent-based models, are able to replicate a number of empirically-observed stylised facts not easily recovered by more traditional alternatives, such models remain notoriously difficult to estimate due to their lack of tractable likelihood functions. While the estimation literature continues to grow, existing attempts have approached the problem primarily from a frequentist perspective, with the Bayesian estimation literature remaining comparatively less developed. For this reason, we introduce a Bayesian estimation protocol that makes use of deep neural networks to construct an approximation to the likelihood, which we then benchmark against a prominent alternative from the existing literature. Overall, we find that our proposed methodology consistently results in more accurate estimates in a variety of settings, including the estimation of financial heterogeneous agent models and the identification of changes in dynamics occurring in models incorporating structural breaks.

Abstract

Keywords: Agent-based modelling, Simulation modelling, Bayesian estimation, Machine learning, Neural networks

JEL Classification: C13 $\cdot$ C52

1 Introduction

Recent years have, to some extent, seen the emergence of a paradigm shift in how economic models are constructed. Traditionally, a need to facilitate mathematical tractability and limited computational resources have led to a dependence on strong assumptions111These include, but are not limited to, assumptions of perfect rationality and the existence of representative agents., many of which are inconsistent with the heterogeneity and non-linearity that characterise real economic systems (Geanakoplos and Farmer 2008; Farmer and Foley 2009; Fagiolo and Roventini 2017). Ultimately, the Great Recession of the late 2000s and the perceived failings of traditional approaches, particularly those built on general equilibrium theory, would lead to the birth of a growing community arguing that the adoption of new paradigms harnessing contemporary advances in computing power could lead to richer and more robust insights (Farmer and Foley 2009; Fagiolo and Roventini 2017).

Perhaps the most prominent examples of this new wave of computational approaches are agent-based models (ABMs), which attempt to model systems by directly simulating the actions of and interactions between their microconstituents (Macal and North 2010). In theory, the flexibility offered by simulation should allow for more empirically-motivated assumptions and this, in turn, should result in a more principled approach to the modelling of the economy (Chen 2003; LeBaron 2006). The extent to which this has been achieved in practice, however, remains open for debate (Hamill and Gilbert 2016).

While ABMs initially found success by demonstrating an ability to replicate a wide array of stylised facts not recovered by more traditional approaches (LeBaron 2006; Barde 2016), their simulation-based nature makes their estimation nontrivial (Fagiolo et al. 2017). Therefore, while the last decade has seen the emergence of increasingly larger and more realistic macroeconomic models, such as the Eurace (Cincotti et al. 2010) and Schumpeter Meeting Keynes (Dosi et al. 2010) models, their acceptance in mainstream policy-making circles remains limited due to these and other challenges.

The aforementioned estimation difficulties largely stem from the simulation-based nature of ABMs, which, in all but a few exceptional cases222See, for example, the work of Alfarano et al. (2005), Alfarano et al. (2006) and Alfarano et al. (2007)., renders it impossible to obtain a tractable expression for the likelihood function. As a result, most existing approaches have attempted to circumvent these difficulties by directly comparing model-simulated and empirically-measured data using measures of dissimilarity (or similarity) and searching the parameter space for appropriate values that minimise (or maximise) these metrics (Grazzini et al. 2017; Lux 2018). The most pervasive of these approaches, which Grazzini and Richiardi (2015) call simulated minimum distance (SMD) methods, is the method of simulated moments (MSM), which constructs an objective function by considering weighted sums of the squared errors between simulated and empirically-measured moments (or summary statistics).

Though MSM has been widely applied in a number of different contexts333See Franke (2009), Franke and Westerhoff (2012), Fabretti (2013), Grazzini and Richiardi (2015), Chen and Lux (2016) and Platt and Gebbie (2018) for examples. and has desirable mathematical properties444The estimator is both consistent and asymptotically normal (McFadden 1989)., it suffers from a critical weakness. In more detail, the choice of moments or summary statistics is entirely arbitrary and the quality of the associated parameter estimates depends critically on selecting a sufficiently comprehensive set of moments, which has proven to be nontrivial in practice. In response, recent years have seen the development of a new generation of SMD methods that largely eliminate the need to transform data into a set of summary statistics and instead harness its full informational content (Grazzini et al. 2017).

These new methodologies vary substantially in their sophistication and theoretical underpinnings. Among the simplest of these approaches is attempting to match time series trajectories directly, as suggested by Recchioni et al. (2015). More sophisticated alternatives include information-theoretic approaches (Barde 2017; Lamperti 2017), simulated maximum likelihood estimation (Kukacka and Barunik 2017), and comparing the causal mechanisms underlying real and simulated data through the use of SVAR regressions (Guerini and Moneta 2017). In addition to the development of similarity metrics, attempts have also been made to reduce the large computational burden imposed by SMD methods by replacing the costly model simulation process with computationally efficient surrogates (Salle and Yildizoglu 2014; Lamperti et al. 2018).

Interestingly, the aforementioned approaches are all frequentist in nature, with Bayesian estimation being significantly less prevalent555There is a rather substantial literature on what are called approximate bayesian computation methods that has gained a significant following in biology and ecology (Sisson et al. 2018). Unfortunately, the vast majority of these methods rely on converting data to a set of summary statistics and their appeal for estimating economic ABMs is therefore limited.. As it currently stands, only one study in the literature (Grazzini et al. 2017) has focused extensively on the use of Bayesian techniques and recent work by Lux (2018) involving sequential Monte Carlo methods includes attempts at Bayesian estimation, though the work as a whole focuses more on a frequentist approach.

While the estimation literature has certainly been growing, it still suffers from a number of key weaknesses. Perhaps the most significant of these is a lack of a standard benchmark against which to compare the performance of new methods. For this reason, most new approaches have traditionally only been tested in isolation and comparative exercises have been relatively rare. For this reason, we compared a number of prominent estimation techniques in a previous investigation (Platt 2019) and found, rather surprisingly, that the Bayesian estimation procedure proposed by Grazzini et al. (2017) consistently outperformed a number of prominent frequentist alternatives in a series of head-to-head tests, despite its relative simplicity. We therefore argued that more interest in Bayesian methods is warranted and suggested that increased emphasis should be placed on their development.

In line with this recommendation, we introduce a method for the Bayesian estimation of economic simulation models666It is worth noting that while we focus on ABMs, the proposed methodology is applicable to any model capable of simulating time series or panel data. For this reason, the methodology would be equally applicable to competing modelling approaches. that relaxes a number of the assumptions made by the approach of Grazzini et al. (2017) through the use of a neural network-based likelihood approximation. We then benchmark our proposed methodology through a series of computational experiments and finally conclude with discussions related to practical considerations, such as the setting of the method’s hyperparameters and the associated computational costs.

2 Estimation and Experimental Procedures

In this section, we introduce the reader to a number of the essential elements of our investigation, including a brief discussion of the fundamentals of Bayesian estimation, a description of the approach of Grazzini et al. (2017) (our chosen benchmark), and an introduction to our proposed estimation methodology.

2.1 Bayesian Estimation of Simulation Models

For our purposes, we consider a simulation model to be any mathematical or algorithmic representation of a real world system capable of producing time series (panel) data of the form

[TABLE]

where $\bm{\theta}$ is a model parameter set in the space of feasible parameter values, $T$ is the length of the simulation, $i$ represents the seed used to initialise the model’s random number generators, and $\bm{x}^{sim}_{t,i}(\bm{\theta})\in\mathbb{R}^{n}$ for all $t=1,2,\dots,T$ .

In general, estimation or calibration procedures aim to determine appropriate values for $\bm{\theta}$ such that $\bm{X}^{sim}(\bm{\theta},T,i)$ produces dynamics that are as close as possible to those observed in an empirically-measured equivalent,

[TABLE]

where $\bm{x}_{t}\in\mathbb{R}^{n}$ for all $t=1,2,\dots,T$ .

Bayesian estimation attempts to achieve the above by first assuming that the parameter values follow a given distribution, $p(\bm{\theta})$ , which is chosen to reflect one’s prior knowledge or beliefs regarding the parameter values. This is then updated in light of empirically-measured data, yielding a modified distribution, $p(\bm{\theta}|\bm{X})$ , called the posterior. Bayesian estimation can therefore be framed in terms of Bayes’ theorem as follows:

[TABLE]

Unfortunately, obtaining an analytical expression for the posterior is typically not feasible. Firstly, the normalisation constant, $p(\bm{X})$ , is unknown or determining it is nontrivial. Secondly, the likelihood, $p(\bm{X}|\bm{\theta})$ , is intractable for most simulation models, particularly large-scale macroeconomic ABMs. Nevertheless, these limitations can be overcome to some extent. Grazzini et al. (2017) provide a method for approximating $p(\bm{X}|\bm{\theta})$ for a particular value of $\bm{\theta}$ , which then allows us to evaluate the right-hand side of

[TABLE]

The above may then be used along with Markov chain Monte Carlo (MCMC) methods, such as the Metropolis-Hastings algorithm, to sample the posterior. This is possible since most MCMC techniques only require that we are able to determine the value of a function proportional to the density function of interest rather than the density function itself. It should be apparent, however, that the overall estimation error will depend critically on the method used to approximate the likelihood.

2.2 The Approach of Grazzini et al. (2017)

As previously stated, Grazzini et al. (2017) provide a method to approximate the likelihood for simulation models, which we now discuss in more detail.

In essence, the approach is based on the assumption that, for all $t\geq\tilde{T}$ , we reach a statistical equilibrium such that $\bm{x}_{t,i}^{sim}(\bm{\theta})$ fluctuates around a stationary level, $\mathbb{E}[\bm{x}_{t,i}^{sim}(\bm{\theta})|t\geq\tilde{T}]$ , which allows us to further assume that $\bm{x}_{\tilde{T},i}^{sim}(\bm{\theta}),\bm{x}_{\tilde{T}+1}^{sim}(\bm{\theta}),\dots,\bm{x}_{T,i}^{sim}(\bm{\theta})$ constitutes a random sample from a given distribution777The samples need not all be drawn from a single Monte Carlo replication and may instead be drawn from the statistical equilibria reached by each replication in an ensemble generated using various random seeds. In practice, we simulate an ensemble of $R$ such Monte Carlo replications for each candidate set of $\bm{\theta}$ values and combine the samples from each replication into a single random sample.. It is then possible to determine a density function that describes this distribution, which we denote by $\tilde{f}(\bm{x}|\bm{\theta})$ , using kernel density estimation (KDE), finally allowing us to approximate the likelihood of the empirically-sampled data888Note that we have assumed, as in the case of the simulated data, that the empirically-sampled data fluctuates around a stationary level. for a given value of $\bm{\theta}$ as follows:

[TABLE]

It should be apparent that the above results in a simple strategy that is easy to apply in most contexts. It must be noted, however, that this is largely made possible through strong assumptions that seldom hold in practice. In more detail, notice that ordered time series (panel) data is essentially being treated as an i.i.d. random sample, implying that $\bm{x}_{t,i}^{sim}(\bm{\theta})\perp\bm{x}_{1,i}^{sim}(\bm{\theta}),\dots,\bm{x}_{t-1,i}^{sim}(\bm{\theta})$ for all $t=2,3,\dots,T$ . Unfortunately, such independence assumptions do hold for most simulation models, since $\bm{x}_{t,i}^{sim}(\bm{\theta})$ is likely be dependent on at least some of the previously realised values, whether this dependence is explicit or mediated through latent variables. Additionally, such assumptions result in a likelihood function that makes no distinction between $\bm{\theta}$ values that result in identical unconditional distributions but differing temporal trends. Since many economic simulation models and particularly large-scale macroeconomic ABMs produce datasets that are characterised by seasonality or structural breaks, there is likely to be some impact on the quality of the resultant parameter estimates.

Nevertheless, Platt (2019) demonstrates that despite the above shortcomings, the method of Grazzini et al. (2017) is able to provide reasonable parameter estimates in many contexts, while also outperforming several more sophisticated frequentist approaches. This warrants further investigation and naturally leads one to ask whether relaxing the required independence assumptions would allow for the construction of a superior Bayesian estimation method.

2.3 Likelihood Approximation using Neural Networks

We now begin our discussion of a relatively simple extension to the likelihood approximation procedure proposed by Grazzini et al. (2017) that is capable of capturing some of the dependence of $\bm{x}_{t,i}^{sim}(\bm{\theta})$ on past realised values. As a starting point, we assume that

[TABLE]

for all $L<t\leq T$ , implying that $\bm{x}_{t,i}^{sim}(\bm{\theta})$ depends only on the past $L$ realised values. Our task, therefore, is the estimation of the above conditional densities,

[TABLE]

for all $L<t\leq T$ , where $\bm{\phi}=\bm{\phi}(\bm{\theta})$ are parameters associated with the density estimation procedure.

In our context, we make use of a mixture density network (MDN), a neural network-based approach to conditional density estimation introduced by Bishop (1994). The aforementioned scheme consists of two primary components999Note that these discussions are primarily illustrative and serve to briefly describe and motivate our approach. A detailed technical description of its implementation is provided in Appendix A., a mixture of $K$ Gaussian random variables,

[TABLE]

where we denote $\bm{x}_{t,i}^{sim}$ by $\bm{y}$ and $\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}$ by $\bm{x}$ , and functions $\alpha_{k}$ , $\bm{\mu}_{k}$ and $\bm{\Sigma}_{k}$ of $\bm{x}$ which determine the mixture parameters. Here, $\alpha_{k}$ , $\bm{\mu}_{k}$ and $\bm{\Sigma}_{k}$ are the outputs of a feedforward neural network taking $\bm{x}$ as input and having weights and biases $\bm{\phi}(\bm{\theta})$ , which are determined by training the network on an ensemble of $R$ Monte Carlo replications simulated by the candidate model for parameter set $\bm{\theta}$ . Using the trained MDN, it is then possible to approximate the likelihood of the empirically-sampled data for a given value of $\bm{\theta}$ as follows:

[TABLE]

While alternative density estimation procedures could potentially have been employed, our consideration of MDNs is motivated primarily by their desirable properties. Specifically, MDNs are, in theory, capable of approximating fairly complex conditional distributions. This follows directly from the fact that mixtures of normal random variables are universal density approximators for sufficiently large $K$ (Scott 2015) and the fact that neural networks are universal function approximators (Hornik et al. 1989), provided they are sufficiently expressive. Therefore, provided that $K$ is sufficiently large and the constructed neural network sufficiently deep (and wide), the above methodology should result in accurate conditional density estimates.

2.4 Method Comparison and Benchmarking

Given that we have now described our proposed estimation methodology, we proceed to discuss our strategy for benchmarking it against the approach of Grazzini et al. (2017), where we follow a similar strategy to that employed in Platt (2019).

We begin by letting $\bm{X}^{sim}(\bm{\theta},T,i)$ be the output of a candidate model, $M$ . Since empirically-observed data is nothing more than a single realisation of the true data-generating process, which may itself be viewed as a model with its own set of parameters, it follows that we may consider $\bm{X}=\bm{X}^{sim}(\bm{\theta}^{true},T^{emp},i^{*})$ as a proxy for real data to which $M$ may be calibrated.

In this case, we are essentially estimating a perfectly-specified model using data for which the true parameter values, $\bm{\theta}^{true}$ , are known. It can be argued that a good estimation method would, in this idealised setting, be able to recover these true values to some extent and that methods which produce estimates closer to $\bm{\theta}^{true}$ would be considered superior. This leads us to define the following loss function

[TABLE]

where $\hat{\bm{\theta}}$ is the parameter estimate (posterior mean) produced by a given Bayesian estimation method.

In practice, it is important that both $\hat{\bm{\theta}}$ and $\bm{\theta}^{true}$ are normalised to take values in the interval $[0,1]$ before the loss function value is calculated. This is because even relatively small estimation errors associated with parameters that typically take on larger values will increase the loss function value substantially more than relatively large estimation errors associated with parameters that typically take on smaller values if no normalisation is performed. Therefore, for each free parameter, $\theta_{j}\in[a,b]$ , we set

[TABLE]

with an analogous transformation being applied to $\theta^{true}_{j}$ .

The above allows us to develop a series of benchmarking exercises in which we compare the loss function values associated with our proposed method and that of Grazzini et al. (2017) for a number of different models, free parameter sets, and $\bm{\theta}^{true}$ values101010While the constructed loss function will act as our primary metric, we will also consider a number of other relevant criteria, such as the standard deviation of the obtained posteriors.. In all of these comparative exercises, we aim to ensure that the overall conditions of the experiments are consistent throughout, regardless of the method used to approximate the likelihood. Therefore, in all cases, we set the length of the proxy for real data to be $T_{emp}=1000$ , the number of Monte Carlo replications in the simulated ensembles to be $R=100$ , the length of each series in the simulated ensembles to be $T_{sim}=1000$ , and the priors for all free parameters to be uniform over the explored parameter ranges. Additionally, we have also used the same lag length, $L=3$ , for all estimation attempts involving our neural network-based method. While seemingly arbitrary, this choice has very clear motivations that are discussed in detail in Section 5.1.

Finally, the MCMC algorithm used to sample the posterior and its associated hyperparameters remain unchanged in all experiments. Rather than using a standard random walk Metropolis-Hastings algorithm, we have instead employed the adaptive scheme proposed by Griffin and Walker (2013), which allows for more effective initialisation, faster convergence, and better handling of multimodal posteriors111111A complete description of the procedure is presented in Appendix B..

3 Candidate Models

With our estimation and benchmarking strategies now described, we introduce the candidate models that we attempt to estimate. Their selection is primarily justified by their ubiquity; each has appeared in a number of calibration and estimation studies121212For example, the Brock and Hommes (1998) model is considered by Recchioni et al. (2015), Lamperti et al. (2018), and Kukacka and Barunik (2017) and the Franke and Westerhoff (2012) model is considered by Franke and Westerhoff (2012) and Lux (2018)., leading them to become standard test cases in the field. While computationally-inexpensive to simulate, most are capable of producing nuanced dynamics and thus still prove to be a reasonable challenge for most contemporary estimation approaches. Since our focus here is the benchmarking of the proposed estimation procedure as opposed to estimating the candidate models using empirical data, our discussion will be relatively brief. In empirical investigations, however, it would be necessary to provide some justification that the chosen models were reasonable representations of the considered systems.

3.1 Brock and Hommes (1998) Model

The first model we introduce, and by far the most popular in the existing literature, is the Brock and Hommes (1998) model, an early example of a class of simulation models that attempt to model the trading of assets on an artificial stock market by simulating the interactions of heterogenous traders that follow various trading strategies.

We focus on a particular version of the model that can be expressed as a system of coupled equations131313The interested reader should refer to Brock and Hommes (1998) for a detailed discussion of the model’s underlying assumptions and the derivation of the above system of equations.,

[TABLE]

where $y_{t}$ is the asset price at time $t$ (in deviations from the fundamental value $p_{t}^{*}$ ), $n_{h,t}$ is the fraction of trader agents following strategy $h\in\left\{1,2,\dots,H\right\}$ at time $t$ , and $R=1+r$ .

Each strategy, $h$ , has an associated trend following component, $g_{h}$ , and bias, $b_{h}$ , both of which are real-valued parameters. The model also includes positive-valued parameters that affect all trader agents, regardless of the strategy they are currently employing, specifically $\beta$ , which controls the rate at which agents switch between various strategies, and the prevailing market interest rate, $r$ .

Finally, assuming an i.i.d. dividend process, the fundamental value $p_{t}^{*}=p^{*}$ is constant, allowing us to obtain the asset price at time $t$ ,

[TABLE]

3.2 Random Walks with Structural Breaks

The second model we consider is a random walk capable of replicating simple structural breaks, defined according to

[TABLE]

where

[TABLE]

Unlike the Brock and Hommes (1998) model, the above is not a representation of a real-world system, but rather an artificially-constructed test example designed to challenge estimation methodologies141414This particular instantiation of the model was first used by Lamperti (2017) to test an information-theoretic criterion called the GSL-div.. Its inclusion is justified on the grounds that, as previously discussed, many large-scale ABMs produce dynamics that are characterised by structural breaks and the fact that it allows us to compare our approach against that of Grazzini et al. (2017) in cases where the considered data demonstrates clear temporal changes in the prevailing dynamics.

3.3 Franke and Westerhoff (2012) Model

The final model we discuss shares a number of conceptual similarities with the previously described Brock and Hommes (1998) model, being a heterogeneous agent model that simulates the interactions of traders following a number of trading strategies. It is, however, different in a number of key areas, particularly in how the probability of an agent switching from one strategy to another is determined and in its incorporation of only two trader types, chartists and fundamentalists.

As in the case of the Brock and Hommes (1998) model, the core elements of the model can be expressed as a system of coupled equations

[TABLE]

where $p_{t}$ is the log asset price at time $t$ , $p^{*}$ is the log of the (constant) fundamental value, $n_{t}^{f}$ and $n_{t}^{c}$ are the market fractions of fundamentalists and chartists respectively at time $t$ , $d_{t}^{f}$ and $d_{t}^{c}$ are the corresponding average demands, and the remaining symbols all correspond to positive-valued parameters.

At this point, it is worth pointing out that Franke and Westerhoff (2012) do not introduce a single model, but rather a family of related formulations built on the same foundation (Eqns. 18-22). These models differ in how they define $a_{t}$ , the attractiveness of fundamentalism relative to chartism at the end of period $t$ , and incorporate a number of different mechanisms, including wealth, herding and price misalignment. This makes the consideration of multiple versions of the model worthwhile and we thus consider two of the proposed versions151515 $\alpha_{n}$ , $\alpha_{w}$ , and $\alpha_{p}$ are strictly positive while $\alpha_{0}$ may take on any real value.:

[TABLE]

referred to as herding, predisposition and misalignment (HPM), and

[TABLE]

referred to as wealth and predisposition (WP).

As a final remark, we consider $r_{t}=p_{t}-p_{t-1}$ , the log return process, rather than $p_{t}$ in our estimation attempts.

4 Results and Discussion

4.1 Brock and Hommes (1998) Model

We now proceed with the presentation of the results of our comparative experiments, beginning with the Brock and Hommes (1998) model161616From this point onwards, we use KDE to refer to the method of Grazzini et al. (2017) and MDN to refer to our proposed method in all tables and figures..

In these experiments, we consider a market with $H=4$ trading strategies and focus on estimating $g_{2}$ , $b_{2}$ , $g_{3}$ , and $b_{3}$ , the trend following and bias components for two of these strategies. For the first free parameter set, we consider $g_{2},b_{2}\in[-1,0]$ and $g_{3},b_{3}\in[0,1]$ , corresponding to a contrarian strategy with a negative bias and a trend following strategy with a positive bias respectively. For the second free parameter set, we instead consider $g_{2},b_{2},g_{3}\in[0,1]$ and $b_{3}\in[-1,0]$ , corresponding to trend following strategies with positive and negatives biases respectively.

Referring to Figure 1, we observe that, for the first free parameter set, there is a pronounced difference in performance between our proposed methodology and that of Grazzini et al. (2017). While both approaches perform similarly when estimating the bias components, our proposed procedure results in marginal posteriors for $g_{2}$ and $g_{3}$ that not only have means noticeably closer to the true parameter values, but are also significantly narrower and more peaked, with their density concentrated in a smaller region of the parameter space. This can be seen as indicative of reduced estimation uncertainty.

Table 1 elaborates on these findings and reveals that similar behaviours also emerge in the case of the second free parameter set. Specifically, we find that the posterior means ( $\bm{\mu}_{posterior}$ ) for both methods result in more or less equivalent estimates for $b_{2}$ and $b_{3}$ , while the posterior mean for our proposed method appears to result in noticeably superior estimates for $g_{2}$ and $g_{3}$ in both cases, ultimately leading to lower loss function values. We also observe that our approach results in reduced posterior standard deviations ( $\bm{\sigma}_{posterior}$ ) consistently for all free parameters, in line with our observation of reduced estimation uncertainty in Figure 1.

In Appendix B, where we describe the method used to sample the posteriors, we indicate that we run the procedure multiple times with different initial conditions and combine the obtained samples into a single, larger sample from which we estimate $\bm{\mu}_{posterior}$ and $\bm{\sigma}_{posterior}$ . We can, however, estimate the posterior mean for each of these runs individually and determine the standard deviation of $\bm{\mu}_{posterior}$ across the instantiations of the algorithm, which we call $\bm{\sigma}_{sample}$ . As shown in Table 1, this standard deviation is generally very small for both methods, suggesting that the posterior mean estimates are generally robust171717This is true for all free parameter sets and models considered in this investigation..

4.2 Random Walks with Structural Breaks

Moving on from the Brock and Hommes (1998) model, we now discuss the estimation of a random walk incorporating a structural break. In these experiments, we consider a fixed structural break location, $\tau=700$ 181818This induces a degree of asymmetry in the data and results in a more challenging and realistic estimation problem than $\tau=500$ ., and determine the extent to which both methods are capable of estimating the pre- and post-break drift, $d_{1},d_{2}\in[0,1]$ , and volatility, $\sigma_{1},\sigma_{2}\in[0,10]$ , for differing underlying changes in the dynamics. While the loss function described in Section 2.4 will still be used as our primary metric, we note that since the considered free parameters directly define the dynamics that characterise the different regimes of the data, it would also be worthwhile to assess the extent to which the competing approaches are able to correctly identify the relationships between the parameters and hence the shift in the pre- and post-break dynamics ( $\Delta_{d}$ and $\Delta_{\sigma}$ ).

Before proceeding, however, there are a number of nuances that should be highlighted. Being a random walk, the model clearly produces non-stationary time series and therefore violates a key assumption of the method of Grazzini et al. (2017). For this reason, it is necessary to consider the series of first differences, $x_{t}-x_{t-1}$ , rather than $x_{t}$ itself. While our approach does not make stationarity assumptions, we have none the less considered the series of first differences when applying both methods to make the comparison as fair as possible. It should also be noted that we have assumed the location of the structural break to be unknown or difficult to determine a-priori (as is the case in most practical problems), meaning that we apply both estimation approaches to the full time series data to estimate both the pre- and post-break parameters simultaneously. If, however, the location of the structural break was known, it would be possible to estimate the relevant parameters separately using appropriate subsets of the data, a less challenging undertaking that we do not consider here.

Now, referring to Table 2, we see that both our proposed estimation methodology and that of Grazzini et al. (2017) perform similarly well when attempting to estimate the pre- and post-break volatility, with both producing reasonable estimates for the free parameters and both being able to identity the correct shift in the dynamics. Referring to Tables 3 and 4, however, we see that more pronounced differences emerge when attempting to estimate the pre- and post-break drift. While this is clearly evident from the fact that the loss function values associated with our proposed methodology are noticeably lower in all cases, a more detailed analysis reveals further distinctions worth mentioning. Table 3, which presents the results for cases involving an increasing drift, reveals that our proposed methodology has correctly identified an increasing trend in both cases and has also correctly identified that the increase in drift for parameter set $4$ is three times that of parameter set $3$ . In contrast to this, the method of Grazzini et al. (2017) incorrectly suggests a decreasing trend in both cases. Table 4, which presents the results for cases involving a decreasing drift, similarly shows that our proposed methodology delivers superior performance when attempting to identify the change in drift.

This change in the relative performances of each method when estimating the drift rather than the volatility is a direct consequence of the relationship between the deterministic and stochastic components of the model. For the selected parameter ranges, the random fluctuations, $\epsilon_{t}$ , dominate the evolution of the model, with the drift producing a more subtle effect, particularly after the structural break occurs. For this reason, correctly estimating the pre- and post-break volatility is a far less challenging task than estimating the pre- and post-break drift. Therefore, while both methods perform well when estimating parameters associated with dominant effects like volatility, our method’s incorporation of dependence on previously observed values seems to be important when estimating parameters related to more nuanced and less dominant aspects of a model.

4.3 Franke and Westerhoff (2012) Model

As stated in Section 3.3, the final model we consider has a number of alternate configurations differing in how the attractiveness of fundamentalism relative to chartism, $a_{t}$ , is determined during each period. For this reason, we consider two of these configurations, HPM and WP, and focus on estimating the parameters associated with the rules governing $a_{t}$ : $\alpha_{n}\in[0,2]$ , $\alpha_{0}\in[-1,1]$ , $\alpha_{p}\in[0,20]$ , $\alpha_{w}\in[0,15000]$ , and $\eta\in[0,1]$ , while also estimating the standard deviation of the noise term appearing in the chartist demand equation, $\sigma_{c}\in[0,5]$ 191919We originally attempted to estimate $\sigma_{f}$ as well, but found this to exhibit a degree of collinearity with $\sigma_{c}$ ..

Referring to Table 5, we see that our proposed estimation methodology appears slightly more effective than that of Grazzini et al. (2017) for the HPM parameter set, producing superior estimates for all but one of the considered free parameters and resulting in a lower loss function value. Nevertheless, the estimates do not differ substantially when comparing the methods. Despite this, we see, in what is a seemingly analogous trend to what was observed in the random walk experiments, that the differences in performance are more pronounced for the WP parameter set. In particular, we see a substantial difference in the loss function values associated with each method, brought about by differences in the quality of estimates produced for $\eta$ .

As illustrated in Figure 2, the method of Grazzini et al. (2017) produces a wide posterior for $\eta$ that is dispersed across the entirety of the explored parameter range, which results in a relatively poor estimate. In contrast to this, we see that the proposed methodology fares better, producing a far narrower posterior and a significantly more accurate estimate. While it is nontrivial to identify any definitive causes for the observed behaviours due to the nonlinear nature of heterogeneous agent models, it is worth pointing out that the inclusion of wealth dynamics in the WP version of the model introduces a dependence of $a_{t}$ on the previous return via Eqns. 24-26, which may in turn increase the strength of the relationship between the current and previously observed values in the log return time series.

As a final remark, notice that for the vast majority of the free parameters considered, the proposed methodology also results in lower posterior standard deviations, as was the case for the Brock and Hommes (1998) model.

4.4 Overall Summary

In the preceding subsections, we have focused primarily on analysing the results on a case-by-case basis. Here, however, we provide a summative comparison across all of the considered models. This is achieved though the consideration of a number of key performance metrics, presented in Table 6, which compare the approaches at both a global and individual parameter level.

The first of the aforementioned metrics, and the most important, $LS_{mdn}<LS_{kde}$ , indicates how often the proposed methodology results in lower loss function values, and hence measures its relative ability to recover the true parameter set. We observe that in all cases considered, our methodology results in lower loss function values, which can be seen as indicative of dominance at the global level.

The second metric, $|\mu_{mdn}^{i}-\theta_{true}^{i}|<|\mu_{kde}^{i}-\theta_{true}^{i}|$ , determines how often our proposed methodology produces superior estimates for individual parameters in a free parameter set. In some situations, one might find that the estimates obtained for a subset of the free parameters by the method of Grazzini et al. (2017) are superior, even if the overall estimate for the entire free parameter set is not as good. Nevertheless, we find that in over $80\%$ of cases, our methodology also results in superior estimates at the level of individual parameters, a comfortable majority. It should also be noted that in virtually all situations where $|\mu_{mdn}^{i}-\theta_{true}^{i}|>|\mu_{kde}^{i}-\theta_{true}^{i}|$ , such as some cases of $b_{2}$ and $b_{3}$ in the Brock and Hommes (1998) model, and $\sigma_{1}$ and $\sigma_{2}$ in the random walk model, the differences in the estimates produced by both methods are incredibly small. In contrast to this, a sizeable number of cases where $|\mu_{mdn}^{i}-\theta_{true}^{i}|<|\mu_{kde}^{i}-\theta_{true}^{i}|$ , such as $g_{2}$ and $g_{3}$ in the Brock and Hommes (1998) model, and $\eta$ in the Franke and Westerhoff (2012) model, are characterised by comparatively large differences in the estimates obtained by the competing approaches. This suggests that our proposed methodology also demonstrates a degree of dominance at the level of individual parameters.

The final metric, $\sigma_{mdn}^{i}<\sigma_{kde}^{i}$ , indicates how frequently our proposed methodology results in reduced posterior standard deviations for individual parameters, which occurs in slightly below $80\%$ of the considered cases, again a comfortable majority202020On closer inspection, it appears that our methodology results in reduced posterior standard deviations more often for parameter sets consisting of more than $2$ free parameters, which may hint at the possibility of the uncertainty of estimation increasing less rapidly for our approach than for the method of Grazzini et al. (2017) as the number of free parameters is increased. Ultimately, further investigation would be required to verify this hypothesis..

Based on the evidence presented by the above metrics as a whole, it would appear that our proposed methodology does indeed compare favourably to that of Grazzini et al. (2017), which was itself already shown to dominate a number of other contemporary approaches in the literature by Platt (2019). This ultimately validates our method as a worthwhile addition to the growing toolbox of estimation methods for economic simulation models.

5 Practical Considerations

5.1 Choosing the Lag Length

As previously stated, we set $L=3$ in all estimation experiments involving our proposed method. Naturally, one may wonder whether this is an arbitrary choice or if there is a systematic way of choosing $L$ . Similarly, one may also wonder if the obtained results are robust to this choice, even if only to some extent. We now address both issues.

When applying the proposed methodology, we observed a phenomenon that appeared to be relatively consistent throughout the experiments. In more detail, we observe that while increasing $L$ initially has a pronounced effect on the estimated conditional densities, there exists some $L^{*}\geq 0$ such that for $L\geq L^{*}$ ,

[TABLE]

or, in other words, the MDN essentially ignores the additional lags.

We illustrate this graphically in Figure 3. Here, we train an MDN on $100$ realisations of length $1000$ generated using the Brock and Hommes (1998) model initialised using parameter set $1$ . We then randomly draw an arbitrary sequence of $6$ consecutive values from a time series of length $1000$ , also generated by the Brock and Hommes (1998) model. This then allows us to use the MDN to plot the conditional density functions for differing choices of $L$ , conditioned on the values generated in the previous step, and observe the aforementioned trend.

Repeating this exercise on models for which the true lag, $L_{true}$ , is known a-priori (see Figures 4 and 5), we see that $L^{*}=L_{true}$ . This has a number of important implications. Firstly, it implies that plots of the type we have constructed here can be used as a means to systematically inform the choice of $L$ for arbitrary models. Secondly, and perhaps more importantly, it implies that if $L\geq L_{true}$ , the procedure should demonstrate at least some robustness to the choice of lag, provided that the MDN is sufficiently expressive and sufficiently well-trained. This explains why simply setting $L=3$ resulted in a high level of estimation performance in our experiments, regardless of the considered model, since the models considered are not characterised by long-range dependencies212121The interested reader should refer to Appendix C for additional discussions..

5.2 Computational Costs

At this point, one may ask whether the proposed estimation routine compares favourably to other contemporary alternatives in terms of computational costs. As stated by Grazzini et al. (2017), the cost of generating simulated data using a candidate model is generally dominant, particularly for large-scale models that may need to be run for several minutes in order to generate a single realisation. It is therefore imperative that any estimation methodology keep the simulated ensemble size, which we call $R$ , to a minimum.

As previously stated, we have selected $R=100$ , which results in a relatively large training set of $R(T_{sim}-L)=99700$ training examples. This compares favourably to most alternatives in the literature on a number of grounds. Firstly, most studies which have attempted to estimate models of similar complexity make use of ensembles consisting of a far greater number of realisations, typically in excess of $R=1000$ (Barde 2017; Lamperti 2017; Lux 2018). Secondly, the training set associated with $R=100$ is already large relative to the complexity of the network architecture we employ222222See Appendix A.4..

To illustrate this point, we repeat the experiments associated with parameter set $1$ of the Brock and Hommes (1998) model, changing only the simulated ensemble size, which has been halved to $R=50$ . We find that even with this drastic decrease in the number of Monte Carlo replications, the proposed methodology still performs well and results in a lower loss function value than was obtained using the method of Grazzini et al. (2017) in the original experiments, with a ratio of $LS_{MDN}/LS_{KDE}=0.7249$ 232323Here $LS_{KDE}$ is determined from the results of the original experiment involving the method of Grazzini et al. (2017) with $R=100$ , while $LS_{MDN}$ is determined from the results of the supplementary experiment involving our proposed methodology with $R=50$ .. This provides some evidence that even for greatly reduced ensemble sizes, our approach remains viable, and implies that the complexity of the candidate model and hence the employed neural network would likely need to be increased substantially before any increase in $R$ beyond $100$ is required.

In addition to concerns related to the size of the simulated ensemble, it is also worthwhile to consider the actual computational costs of the neural network training procedure relative to those associated with the generation of a single model realisation. For this reason, Figure 6 demonstrates the total training time required by various neural network configurations, most of which are larger than that of the network employed in this investigation, which typically takes $\sim 5$ seconds to be completely trained. We find that even for substantially more complex neural networks than those considered in our investigation, the overall training time is still typically less than $40$ seconds, which compares favourably to the simulation time of large-scale models, and we additionally find that the increase in computational time is linear for both increases in the lag length and network width.

Further, it should be noted that GPU parallelisation was not employed when generating the aforementioned computational cost diagrams. Given the significant speedup that could be expected with the use of such hardware, typically in the region of $20\times$ (Oh and Jung 2004), we find there to be at least some evidence that the time taken to train the neural network will generally be negligible in comparison to the time taken to generate a single model realisation, even for far more sophisticated neural networks and candidate models. This would, however, require further testing that is beyond the scope of this investigation and we thus suggest that the proposed routine be applied to more sophisticated models in future work.

6 Conclusion

In the preceding sections, we have introduced a neural network-based protocol for the Bayesian estimation of economic simulation models (with a particular focus on ABMs) and demonstrated its estimation capabilities relative to a leading method in the existing literature.

Overall, we find that our method delivers compelling performance in a number of scenarios, including the estimation of heterogeneous agent models typically used to test estimation procedures, and less orthodox examples, such as identifying dynamic shifts in data generated by a random walk model. In all of the cases tested, we find that our proposed methodology produces estimates closer to known ground truth values than the approach proposed by Grazzini et al. (2017) and also find that it typically results in narrower and more sharply peaked posteriors for larger free parameter sets.

In addition to our primary findings, we also discuss practical issues related to the applicability of the proposed routine. We demonstrate that the lag length, which can be viewed as our approach’s primary hyperparameter, can be systematically chosen and that the overall estimation performance demonstrates at least some robustness to this choice. Further, we provide a number of arguments as to the protocol’s computational efficiency relative to a number of prominent alternatives in the literature and therefore suggest that attempts be made to apply it to models of a larger scale in future research.

Acknowledgements

The author would like to thank J. Doyne Farmer for helpful discussions that greatly aided the process of preparing this manuscript and the UK government for the award of a Commonwealth Scholarship. Responsibility for the conclusions herein lies entirely with the author.

Appendix A Technical Details of the Proposed Estimation Procedure

While we presented an overview of our estimation procedure in Section 2, the associated discussions were primarily illustrative and omitted several key details. We thus provide a more technical, step-by-step discussion of our approach in this section.

A.1 Training Set Construction

The primary aim of our methodology is the construction of an approximation to the likelihood function for a given set of parameter values, $\bm{\theta}$ . In order to facilitate this process, we make the simplifying assumption that $\bm{x}^{sim}_{t,i}(\bm{\theta})$ depends only on $\bm{x}^{sim}_{t-L,i}(\bm{\theta}),\dots,\bm{x}^{sim}_{t-1,i}(\bm{\theta})$ , for all $L<t\leq T$ . Our problem therefore reduces to the estimation of conditional densities of the form $p\left(\bm{x}_{t,i}^{sim}\big{|}\bm{x}_{t-L,i}^{sim},\dots,\bm{x}_{t-1,i}^{sim}:\bm{\theta}\right)$ .

In order to estimate the above conditional densities, we will require an appropriate dataset, which is constructed in a number of stages. The first of these stages involves the use of the candidate model to generate an ensemble of $R$ Monte Carlo replications, $\bm{X}^{sim}(\bm{\theta},T^{sim},i),i=i_{0},i_{0}+1,\dots,i_{0}+R-1$ , for a given value of $\bm{\theta}$ . This is then followed by the construction of two ordered sets for each Monte Carlo replication $i$ in the ensemble,

[TABLE]

and

[TABLE]

Finally, the sets $\bm{X}^{train}_{i}(\bm{\theta}),i=i_{0},i_{0}+1,\dots,i_{0}+R-1$ are concatenated, in order, to produce a single, larger ordered set, $\bm{X}^{train}(\bm{\theta})$ , with an analogous procedure being applied to $\bm{Y}^{train}_{i}(\bm{\theta})$ to yield $\bm{Y}^{train}(\bm{\theta})$ .

In essence, $\bm{X}^{train}(\bm{\theta})$ consists of rolling windows of length $L$ drawn from the ensemble of Monte Carlo replications, while $\bm{Y}^{train}(\bm{\theta})$ consists of the $\bm{x}^{sim}_{t,i}(\bm{\theta})$ values that directly follow each window in $\bm{X}^{train}(\bm{\theta})$ . Together, they form a training set of size $R(T-L)$ that can be used to approximate the required conditional densities.

A.2 Neural Network Specification and Training

With an appropriate dataset now constructed, we proceed with a more detailed discussion of the MDN itself.

As a starting point, let $H$ be a feedforward neural network with input layer $\bm{h}_{0}$ (taking in windows of length $L$ ), hidden layers $\bm{h}_{1},\bm{h}_{2},\dots,\bm{h}_{n-1}$ , output layer $\bm{h}_{n}$ , and weights and biases $\bm{\psi}$ . The mixture parameters are then defined as

[TABLE]

and

[TABLE]

where $diag(\bm{x})$ is a diagonal matrix with diagonal $\bm{x}$ and

[TABLE]

This results in an expanded neural network with weights and biases

[TABLE]

that takes windows of length $L$ as input and outputs $\bm{\alpha}$ , $\bm{\mu}_{k}$ , and $\bm{\Sigma}_{k}$ as defined above.

At this stage, there are a number of nuances worth highlighting. In Eqn. 30, notice that we make use of the $softmax$ function. This ensures that the mixture weights, $\bm{\alpha}$ , are strictly positive and sum to one, as required. Additionally, notice that in Eqn. 32 we consider a diagonal rather than a full covariance matrix242424It should be noted that the universal density approximation properties of Gaussian mixtures still apply for diagonal covariance matrices.. If we had not made such an assumption, we would have to ensure that the covariance matrices returned by our neural network were positive definite. Though possible in principle, this would significantly increase the number of network parameters and have a potentially detrimental effect on computational performance [Rothfuss et al., 2019]. Finally, it should be apparent from Eqn. 33 that the neural network outputs a vector of log variances rather than the diagonal covariance matrix, allowing us to avoid imposing positivity constraints on the network output.

Now, all that remains is the training of our constructed network, which is achieved through the application of maximum likelihood estimation to our training set. Denoting by $\bm{X}_{m}^{train}$ the $m$ -th entry in $\bm{X}^{train}(\bm{\theta})$ (with $\bm{Y}_{m}^{train}$ being similarly defined), maximum likelihood estimation is equivalent to solving

[TABLE]

using stochastic gradient descent methods.

A.3 Data Normalisation and Regularisation

While the scheme we have just described could be applied as is, it is likely to perform suboptimally in its current form. This is because neural networks, like most machine learning techniques with a large number of free parameters, have a tendency to overfit the training data and thus perform poorly out-of-sample, particularly when the training set is small [Murphy, 2012]. In practice, this is often addressed using early stopping, a technique that requires a percentage of the data to be kept separate from the training set in order to evaluate out-of-sample performance during each epoch [Prechelt, 1998]. Such a solution is, however, undesirable in our context, since it requires the generation of additional data, an expensive undertaking for large-scale simulation models.

Fortunately, Rothfuss et al. [2019] present a set of best practices for conditional density estimation using neural networks that provides an alternative solution for overfitting. In particular, a technique called noise regularisation is employed, in which small random perturbations are applied to the data during the training process. It can be shown that this ultimately results in a complexity penalty that favours smoother density estimates that are less prone to overfitting [Rothfuss et al., 2019]. For this reason, we apply Gaussian perturbations to training examples in $\bm{X}^{train}(\bm{\theta})$ and $\bm{Y}^{train}(\bm{\theta})$ , which we denote by

[TABLE]

respectively.

It should be apparent that the degree of regularisation depends directly on the magnitudes of the standard deviations $\eta_{x}$ and $\eta_{y}$ relative to the range of variation in the training data252525As an example, setting $\eta_{x}=0.5$ would result in a substantial amount of regularisation for training examples that take values in $[0,1]$ , while essentially having no effect for training examples taking values in $[0,1000]$ .. This implies that $\eta_{x}$ and $\eta_{y}$ would have to be adjusted for each candidate model in order to result in the same degree of regularisation. Rothfuss et al. [2019] therefore propose a data normalisation scheme that ensures the training data exhibits zero mean and unit variance, eliminating the need to retune these hyperparameters for each candidate model. This is achieved through the application of a simple transformation to each training example.

Letting $\hat{\bm{\mu}}_{x}$ and $\hat{\bm{\sigma}}_{x}$ be vectors that contain estimates of the mean and standard deviation along each dimension for training examples in $\bm{X}^{train}(\bm{\theta})$ , this transformation is given by

[TABLE]

with $\hat{\bm{\mu}}_{y}$ , $\hat{\bm{\sigma}}_{y}$ and $\tilde{\bm{Y}}_{m}^{train}$ being defined analogously.

Once the network has been trained on the normalised dataset, we are required to evaluate $\tilde{f}(\bm{x},\bm{y},\bm{\phi})$ , originally defined in Eqn. 8. This is achieved through a simple procedure. Firstly, the normalisation transform is applied to $\bm{x}$ and $\bm{y}$ using the same $\hat{\bm{\mu}}_{y}$ , $\hat{\bm{\sigma}}_{y}$ , $\hat{\bm{\mu}}_{x}$ and $\hat{\bm{\sigma}}_{x}$ values defined in Eqn. 37, yielding $\tilde{\bm{x}}$ and $\tilde{\bm{y}}$ . $\tilde{\bm{x}}$ is then fed through the trained neural network to yield corresponding mixture parameters, allowing us to evaluate the density at $\tilde{\bm{y}}$ , which we denote by $\tilde{g}(\tilde{\bm{x}},\tilde{\bm{y}},\tilde{\bm{\phi}})$ . It should be noted that $\tilde{g}$ does not directly correspond to $\tilde{f}$ , since we have made a change of variables and the volume of the probability density is not preserved under the normalisation transform for $\hat{\bm{\sigma}}_{y}\neq 1$ . Rothfuss et al. [2019] do, however, prove that

[TABLE]

where $\hat{\sigma}_{y}^{(j)}$ is the $j$ -th element of $\hat{\bm{\sigma}}_{y}$ , allowing us to easily calculate the required density.

A.4 Neural Network Architecture

In essence, we have defined a general neural network-based approach to simulation model estimation that is independent of the specific network architecture (number of hidden layers, number of neurons, type of activation functions, and so on) used. Nevertheless, for the sake of completeness, we briefly introduce the (relatively simple) architecture employed in our study, which is used consistently throughout unless stated otherwise.

For the mixture model itself, we set the number of mixture components to be $K=16$ , with the associated mixture parameter network consisting of $3$ hidden layers, each with $32$ neurons and ReLU activations. This was trained using the well-known Adam optimiser [Kingma and Ba, 2015] over $12$ epochs262626Any improvements in the likelihood for subsequent epochs were generally negligible., with a batch size of $512$ and noise regularisation parameters $\eta_{x}=\eta_{y}=0.2$ .

The above architecture, which performed well for all of the estimation tasks conducted, was, perhaps rather surprisingly, the first architecture we considered and was chosen by hand rather than through an automated optimisation procedure. Attempts to improve performance by increasing the number of hidden layers, neurons, and mixture components seemed to have little effect, suggesting that the proposed network is sufficiently expressive to produce high-quality density estimates for our considered set of problems. We suspect that this will likely hold for other models of similar complexity and therefore make the recommendation that our proposed architecture be used as a baseline for future investigations employing this estimation methodology.

For more complex models, however, it may be necessary to construct more expressive networks and in such cases we would suggest that some form of hyperparameter optimisation be carried out. This is beyond the scope of our investigation, however, and we thus leave it to future research.

Appendix B Technical Details of the Employed Sampling Strategy

In this section, we briefly discuss the adaptive Metropolis-Hastings algorithm that has been employed in all of the conducted estimation experiments. Our discussion here is mainly illustrative and positioned in the context of our investigation. The interested reader should therefore refer to the original contribution by Griffin and Walker [2013] for theoretical justifications and a more general discussion.

In essence, the approach is centred on the idea of maintaining a set of samples, $\bm{\theta}_{s}=\left\{\bm{\theta_{s}}^{(1)},\bm{\theta_{s}}^{(2)},\dots,\bm{\theta_{s}}^{(N)}\right\},s=1,2,\dots,S$ , that is updated for a desired number of iterations. Initially, the set consists of samples drawn uniformly from the space of feasible parameter values, $\bm{\Theta}$ , but eventually converges to be distributed according to $p(\bm{\theta}|\bm{X})$ . This is achieved through the construction of an adaptive proposal distribution that is dependent on the current samples, $\bm{\theta}_{s}$ , which can be summarised algorithmically as follows:

Sample $\bm{z}$ according to $\tilde{p}\left(\bm{z}\big{|}\bm{\theta_{s}}^{(1)},\bm{\theta_{s}}^{(2)},\dots,\bm{\theta_{s}}^{(N)}\right)$ , which is determined by applying KDE to $\bm{\theta_{s}}^{(1)},\bm{\theta_{s}}^{(2)},\dots,\bm{\theta_{s}}^{(N)}$ . 2. 2.

Propose the switch of $\bm{z}$ with $\bm{\theta_{s}}^{(n)}$ , where $n$ is chosen uniformly from $\left\{1,2,\dots,N\right\}$ . 3. 3.

Accept the switch with probability

[TABLE] 4. 4.

If accepted, set $\bm{\theta}_{s+1}=\bm{\theta}_{s}$ with $\bm{\theta_{s}}^{(n)}$ replaced by $\bm{z}$ , otherwise simply set $\bm{\theta}_{s+1}=\bm{\theta}_{s}$ .

Repeating the above for $S$ iterations, we obtain a sequence of sample sets that can be used to compute expectations of the form

[TABLE]

In our investigation, we set $S=5000$ and $N=70$ in all cases, with convergence typically observed at some point before $s=1500$ , leading us to discard the first $1500$ sets as part of a burn-in period. When constructing the posterior samples, we repeat this entire sampling process $5$ times and collect the obtained sets to form a larger collection of $5\times 3500\times 70=1225000$ samples272727Note that since we only update a single sample during each step, the Monte Carlo variance still decreases at the standard rate of $\frac{1}{\sqrt{S}}$ ..

Ultimately, this has become our MCMC algorithm of choice for two main reasons:

The number of iterations required to reach convergence in random walk Metropolis-Hastings algorithms depends significantly on the initialisation of the algorithm. If, for example, the initial candidate parameter set has a particularly low posterior density, it could take a substantial period of time before convergence is observed. Since the algorithm proposed by Griffin and Walker [2013] is initialised using a sample of points from a number of areas of the parameter space, this problem is less pronounced. 2. 2.

Most random walk Metropolis-Hastings algorithms require careful tuning of the proposal distribution, usually with the aim of obtaining an acceptance rate of roughly $25\%$ , in order to ensure a good balance between local exploration of high density areas of the parameter space and global coverage of the parameter space as a whole [Robert and Casella, 2010]. This can be difficult to achieve in practice, making an adaptive approach that determines the proposal distribution automatically particularly appealing.

Appendix C Robustness Tests

In Section 5.1, we provided evidence that our proposed estimation procedure demonstrates some robustness relative to the choice of lag length, $L$ . Here, we provide a more complete demonstration by repeating all of the previously conducted estimation experiments involving our approach, changing only the lag length, which we have increased to $L=4$ . Referring to the summary presented in Table 7, we find that the overall performance of the procedure relative to our chosen benchmark is virtually unchanged282828Since there are a total of $27$ individual parameter cases, the percentage shifts correspond to changes in only a single binary relation for both $|\mu_{mdn}^{i}-\theta_{true}^{i}|<|\mu_{kde}^{i}-\theta_{true}^{i}|$ and $\sigma_{mdn}^{i}<\sigma_{kde}^{i}$ ., verifying the robustness of our conclusions.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alfarano et al. [2005] S. Alfarano, T. Lux, and F. Wagner. Estimation of agent-based models: The case of an asymmetric herding model. Computational Economics , 26(1):19–49, 2005.
2Alfarano et al. [2006] S. Alfarano, T. Lux, and F. Wagner. Estimation of a simple agent-based model of financial markets: An application to australian stock and foreign exchange data. Physica A: Statistical Mechanics and its Applications , 370(1):38–42, 2006.
3Alfarano et al. [2007] S. Alfarano, T. Lux, and F. Wagner. Empirical validation of stochastic models of interacting agents. The European Physical Journal B: Condensed Matter and Complex Systems , 55(2):183–187, 2007.
4Barde [2016] S. Barde. Direct comparison of agent-based models of herding in financial markets. Journal of Economic Dynamics and Control , 73:326–353, 2016.
5Barde [2017] S. Barde. A practical, accurate, information criterion for nth order markov processes. Computational Economics , 50(281-324), 2017.
6Bishop [1994] C. Bishop. Mixture density networks. Technical report, Aston University, 1994.
7Brock and Hommes [1998] W. Brock and C. Hommes. Heterogeneous beliefs and routes to chaos in a simple asset pricing model. Journal of Economic Dynamics and Control , 22(8-9):1235–1274, 1998.
8Chen [2003] S. Chen. Agent-based computational macroeconomics: A survey. In T. Terano, H. Deguchi, and K. Takadama, editors, Meeting the Challenge of Social Problems via Agent-Based Simulation , pages 141–170. Springer-Verlag, 2003.