Efficient inference for stochastic differential equation mixed-effects   models using correlated particle pseudo-marginal algorithms

Samuel Wiqvist; Andrew Golightly; Ashleigh T. McLean; Umberto Picchini

arXiv:1907.09851·stat.CO·January 22, 2021·Comput. Stat. Data Anal.

Efficient inference for stochastic differential equation mixed-effects models using correlated particle pseudo-marginal algorithms

Samuel Wiqvist, Andrew Golightly, Ashleigh T. McLean, Umberto Picchini

PDF

1 Repo

TL;DR

This paper introduces an efficient Bayesian inference method for stochastic differential equation mixed-effects models using correlated particle pseudo-marginal algorithms, significantly improving computational speed for complex hierarchical models.

Contribution

It develops a Gibbs sampling framework with correlated pseudo-marginal steps and blocking strategies, enhancing efficiency for SDEMEMs without requiring bespoke analytical derivations.

Findings

01

Achieves approximately tenfold increase in computational efficiency.

02

Demonstrates effectiveness on tumor growth and neuronal data.

03

Applicable to a broad class of SDEMEMs.

Abstract

Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that are able to account for random variability inherent in the underlying time-dynamics, as well as the variability between experimental units and, optionally, account for measurement error. Fully Bayesian inference for state-space SDEMEMs is performed, using data at discrete times that may be incomplete and subject to measurement error. However, the inference problem is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameter values of interest. The algorithm is made computationally efficient through careful use of blocking strategies and correlated pseudo-marginal Metropolis-Hastings steps within the Gibbs…

Tables5

Table 1. Table 1: OU SDEMEM. Correlation ρ 𝜌 \rho , number of particles N 𝑁 N , CPU time (in minutes m 𝑚 m ), minimum ESS (mESS), minimum ESS per minute (mESS/m) and relative minimum ESS per minute (Rel.) as compared to PMMH-naive. All results are based on 50 50 50 k iterations of each scheme, and are medians over 5 independent runs of each algorithm on different data sets. We could only produce 5 runs due to the very high computational cost of PMMH.

Algorithm	$ρ$	$N$	CPU (m)	mESS	mESS/m	Rel.
Kalman	-	-	1.23	443.27	357.61	5140.18
PMMH-naive	0	3000	4601.87	229.01	0.05	1.00
PMMH	0	3000	4086.91	232.94	0.06	1.16
CPMMH-099	0.99	100	200.37	234.54	1.17	23.58
CPMMH-0999	0.999	50	110.88	235.63	2.13	41.48

Table 2. Table 2: Tumor model. Correlation ρ 𝜌 \rho , number of particles N 𝑁 N , CPU time (in minutes m 𝑚 m ), minimum ESS (mESS), minimum ESS per minute (mESS/m) and relative minimum ESS per minute (Rel.) as compared to PMMH-naive. All results are based on 500 500 500 k iterations of each scheme.

Algorithm	$ρ$	$N$	CPU (m)	mESS	mESS/m	Rel.
LNA	-	-	1286	3676	2.858	13
PMMH - naive	0	30	3098	665	0.215	1
PMMH	0	30	2963	2559	0.864	4
CPMMH	0.999	10	957	2311	2.415	11

Table 3. Table 3: Tumor model (Euler-Maruyama). Correlation ρ 𝜌 \rho , number of particles N 𝑁 N , CPU time (in minutes m 𝑚 m ), minimum ESS (mESS), minimum ESS per minute (mESS/m) and relative minimum ESS per minute (Rel.) as compared to PMMH-naive. All results are based on 500 500 500 k iterations of each scheme.

Algorithm	$ρ$	$N$	CPU (m)	mESS	mESS/m	Rel.
PMMH - naive	0	30	7947	990	0.123	1
PMMH	0	30	7651	2240	0.293	2.4
CPMMH	0.999	10	1893	2172	1.15	9.2

Table 4. Table 4: Neuronal model. Correlation ρ 𝜌 \rho , number of particles N 𝑁 N , CPU time (in minutes m 𝑚 m ), minimum ESS (mESS), minimum ESS per minute (mESS/m), and relative minimum ESS per minute (Rel.) as compared to PMMH. All results are based on 100 100 100 k iterations of each scheme.

Algorithm	$ρ$	$N$	CPU (m)	mESS	mESS/m	Rel.
Kalman	-	-	56	630	11.30	18.9
PMMH	-	1	479	287	0.6	1.0
CPMMH-09	0.9	1	655	400	0.61	1.0
CPMMH-0999	0.999	1	653	372	0.57	1.0

Table 5. Table 5: Comparing 100 log-likelihood estimations for the bootstrap and bridge filter.

	Log-likelihood	Std. Dev.	Runtime (sec)
Kalman	62091	-	0.012
Bootstrap	-2594152	119905	21.51
Bridge	62291	0.34	27.50

Equations137

d X_{t}^{i} = α (X_{t}^{i}, κ, ϕ^{i}, D^{i}) d t + β (X_{t}^{i}, κ, ϕ^{i}, D^{i}) d W_{t}^{i}, X_{0}^{i} = x_{0}^{i}, i = 1, \dots, M .

d X_{t}^{i} = α (X_{t}^{i}, κ, ϕ^{i}, D^{i}) d t + β (X_{t}^{i}, κ, ϕ^{i}, D^{i}) d W_{t}^{i}, X_{0}^{i} = x_{0}^{i}, i = 1, \dots, M .

Y_{t}^{i} = h (X_{t}^{i}, S^{i}, ϵ_{t}^{i}), ϵ_{t}^{i} ∣ ξ \sim in d e p p_{ϵ} (ξ), i = 1, ..., M

Y_{t}^{i} = h (X_{t}^{i}, S^{i}, ϵ_{t}^{i}), ϵ_{t}^{i} ∣ ξ \sim in d e p p_{ϵ} (ξ), i = 1, ..., M

π (κ, η, ξ, ϕ, x ∣ y) \propto π (κ, η, ξ) π (ϕ ∣ η) π (x ∣ κ, ϕ) π (y ∣ x, ξ)

π (κ, η, ξ, ϕ, x ∣ y) \propto π (κ, η, ξ) π (ϕ ∣ η) π (x ∣ κ, ϕ) π (y ∣ x, ξ)

π (ϕ ∣ η) = i = 1 \prod M π (ϕ^{i} ∣ η),

π (ϕ ∣ η) = i = 1 \prod M π (ϕ^{i} ∣ η),

π (y ∣ x, ξ) = i = 1 \prod M j = 1 \prod n π (y_{j}^{i} ∣ x_{j}^{i}, ξ)

π (y ∣ x, ξ) = i = 1 \prod M j = 1 \prod n π (y_{j}^{i} ∣ x_{j}^{i}, ξ)

π (x ∣ κ, ϕ) = i = 1 \prod M π (x_{1}^{i}) j = 2 \prod n π (x_{j}^{i} ∣ x_{j - 1}^{i}, κ, ϕ^{i}) .

π (x ∣ κ, ϕ) = i = 1 \prod M π (x_{1}^{i}) j = 2 \prod n π (x_{j}^{i} ∣ x_{j - 1}^{i}, κ, ϕ^{i}) .

Δ X_{t}^{i} \equiv X_{t + Δ t}^{i} - X_{t}^{i} = α (X_{t}^{i}, κ, ϕ^{i}) Δ t + β (X_{t}^{i}, κ, ϕ^{i}) Δ W_{t}^{i}

Δ X_{t}^{i} \equiv X_{t + Δ t}^{i} - X_{t}^{i} = α (X_{t}^{i}, κ, ϕ^{i}) Δ t + β (X_{t}^{i}, κ, ϕ^{i}) Δ W_{t}^{i}

X_{t + Δ t}^{i} = X_{t}^{i} + α (X_{t}^{i}, κ, ϕ^{i}) Δ t + β (X_{t}^{i}, κ, ϕ^{i}) Δ W_{t}^{i}

X_{t + Δ t}^{i} = X_{t}^{i} + α (X_{t}^{i}, κ, ϕ^{i}) Δ t + β (X_{t}^{i}, κ, ϕ^{i}) Δ W_{t}^{i}

π (κ, η, ξ, ϕ ∣ y)

π (κ, η, ξ, ϕ ∣ y)

\propto π (κ) π (η) π (ξ) i = 1 \prod M π (ϕ^{i} ∣ η) π (y^{i} ∣ κ, ξ, ϕ^{i}) .

\overset{π}{^}_{u} (y ∣ κ, ξ, ϕ) = i = 1 \prod M \overset{π}{^}_{u^{i}} (y^{i} ∣ κ, ξ, ϕ^{i})

\overset{π}{^}_{u} (y ∣ κ, ξ, ϕ) = i = 1 \prod M \overset{π}{^}_{u^{i}} (y^{i} ∣ κ, ξ, ϕ^{i})

π (κ, η, ξ, ϕ, u ∣ y)

π (κ, η, ξ, ϕ, u ∣ y)

\int π (κ, η, ξ, ϕ, u ∣ y) d u

\int π (κ, η, ξ, ϕ, u ∣ y) d u

\propto π (κ, η, ξ, ϕ ∣ y) .

π (ϕ, u ∣ κ, η, ξ, y)

π (ϕ, u ∣ κ, η, ξ, y)

min {1, \frac{π ( ϕ ^{i *} ∣ \cdot )}{π ( ϕ ^{i} ∣ \cdot )} \times \frac{π ^ _{u^{i *}} ( y ^{i} ∣ ϕ ^{i *} , \cdot )}{π ^ _{u^{i}} ( y ^{i} ∣ ϕ ^{i} , \cdot )} \times \frac{q ( ϕ ^{i} ∣ ϕ ^{i *} )}{q ( ϕ ^{i *} ∣ ϕ ^{i} )}} .

min {1, \frac{π ( ϕ ^{i *} ∣ \cdot )}{π ( ϕ ^{i} ∣ \cdot )} \times \frac{π ^ _{u^{i *}} ( y ^{i} ∣ ϕ ^{i *} , \cdot )}{π ^ _{u^{i}} ( y ^{i} ∣ ϕ ^{i} , \cdot )} \times \frac{q ( ϕ ^{i} ∣ ϕ ^{i *} )}{q ( ϕ ^{i *} ∣ ϕ ^{i} )}} .

\overset{π}{^} (x_{t}^{i} ∣ y_{1 : t}^{i}, κ, ξ, ϕ^{i}) \propto π (y_{t}^{i} ∣ x_{t}^{i}, ξ) k = 1 \sum N_{i} π (x_{t}^{i} ∣ x_{t - 1, k}^{i}, κ, ϕ^{i}) w (u_{t - 1, k}^{i}) .

\overset{π}{^} (x_{t}^{i} ∣ y_{1 : t}^{i}, κ, ξ, ϕ^{i}) \propto π (y_{t}^{i} ∣ x_{t}^{i}, ξ) k = 1 \sum N_{i} π (x_{t}^{i} ∣ x_{t - 1, k}^{i}, κ, ϕ^{i}) w (u_{t - 1, k}^{i}) .

f_{t} (u_{t, k}^{i}) = x_{t - 1, k}^{i} + α (x_{t - 1, k}^{i}, κ, ϕ^{i}) Δ t + β (x_{t - 1, k}^{i}, κ, ϕ^{i}) Δ t \times u_{t, k}^{i}

f_{t} (u_{t, k}^{i}) = x_{t - 1, k}^{i} + α (x_{t - 1, k}^{i}, κ, ϕ^{i}) Δ t + β (x_{t - 1, k}^{i}, κ, ϕ^{i}) Δ t \times u_{t, k}^{i}

\tilde{w} (u_{1, k}^{i}) = π (y_{1}^{i} ∣ x_{1, k}^{i}, ξ), w (u_{1, k}^{i}) = \frac{w ~ ( u _{1, k}^{i} )}{\sum _{j = 1}^{N_{i}} w ~ ( u _{1, j}^{i} )} .

\tilde{w} (u_{1, k}^{i}) = π (y_{1}^{i} ∣ x_{1, k}^{i}, ξ), w (u_{1, k}^{i}) = \frac{w ~ ( u _{1, k}^{i} )}{\sum _{j = 1}^{N_{i}} w ~ ( u _{1, j}^{i} )} .

\tilde{w} (u_{t, k}^{i}) = π (y_{t}^{i} ∣ x_{t, k}^{i}, ξ), w (u_{t, k}^{i}) = \frac{w ~ ( u _{t, k}^{i} )}{\sum _{j = 1}^{N_{i}} w ~ ( u _{t, j}^{i} )} .

\tilde{w} (u_{t, k}^{i}) = π (y_{t}^{i} ∣ x_{t, k}^{i}, ξ), w (u_{t, k}^{i}) = \frac{w ~ ( u _{t, k}^{i} )}{\sum _{j = 1}^{N_{i}} w ~ ( u _{t, j}^{i} )} .

\overset{π}{^}_{u_{1 : t}^{i}} (y_{1 : t}^{i} ∣ κ, ξ, ϕ^{i}) = \overset{π}{^}_{u_{1 : t - 1}^{i}} (y_{1 : t - 1}^{i} ∣ κ, ξ, ϕ^{i}) \overset{π}{^}_{u_{t}^{i}} (y_{t}^{i} ∣ y_{1 : t - 1}^{i}, κ, ξ, ϕ^{i})

\overset{π}{^}_{u_{1 : t}^{i}} (y_{1 : t}^{i} ∣ κ, ξ, ϕ^{i}) = \overset{π}{^}_{u_{1 : t - 1}^{i}} (y_{1 : t - 1}^{i} ∣ κ, ξ, ϕ^{i}) \overset{π}{^}_{u_{t}^{i}} (y_{t}^{i} ∣ y_{1 : t - 1}^{i}, κ, ξ, ϕ^{i})

\overset{π}{^}_{u^{i}} (y^{i} ∣ κ, ξ, ϕ^{i}) = N_{i}^{- n} t = 1 \prod n k = 1 \sum N_{i} \tilde{w} (u_{t, k}^{i}) .

\overset{π}{^}_{u^{i}} (y^{i} ∣ κ, ξ, ϕ^{i}) = N_{i}^{- n} t = 1 \prod n k = 1 \sum N_{i} \tilde{w} (u_{t, k}^{i}) .

min {1, \frac{π ( ϕ ^{i *} ∣ η )}{π ( ϕ ^{i, (j - 1)} ∣ η )} \times \frac{π ^ _{u^{i *}} ( y ^{i} ∣ κ ^{(j - 1)} , ξ ^{(j - 1)} , ϕ ^{i *} )}{π ^ _{u^{i, (j - 1)}} ( y ^{i} ∣ κ ^{(j - 1)} , ξ ^{(j - 1)} , ϕ ^{i, (j - 1)} )} \times \frac{q ( ϕ ^{i, (j - 1)} ∣ ϕ ^{i *} )}{q ( ϕ ^{i *} ∣ ϕ ^{i, (j - 1)} )}}

min {1, \frac{π ( ϕ ^{i *} ∣ η )}{π ( ϕ ^{i, (j - 1)} ∣ η )} \times \frac{π ^ _{u^{i *}} ( y ^{i} ∣ κ ^{(j - 1)} , ξ ^{(j - 1)} , ϕ ^{i *} )}{π ^ _{u^{i, (j - 1)}} ( y ^{i} ∣ κ ^{(j - 1)} , ξ ^{(j - 1)} , ϕ ^{i, (j - 1)} )} \times \frac{q ( ϕ ^{i, (j - 1)} ∣ ϕ ^{i *} )}{q ( ϕ ^{i *} ∣ ϕ ^{i, (j - 1)} )}}

min {1, \frac{π ( κ ^{*} ) π ( ξ ^{*} )}{π ( κ ^{(j - 1)} ) π ( ξ ^{(j - 1)} )} \times \frac{π ^ _{u^{(j)}} ( y ∣ κ ^{*} , ξ ^{*} , ϕ ^{(j)} )}{π ^ _{u^{(j)}} ( y ∣ κ ^{(j - 1)} , ξ ^{(j - 1)} , ϕ ^{(j)} )} \times \frac{q ( κ ^{(j - 1)} , ξ ^{(j - 1)} ∣ κ ^{*} , ξ ^{*} )}{q ( κ ^{*} , ξ ^{*} ∣ κ ^{(j - 1)} , ξ ^{(j - 1)} )}}

min {1, \frac{π ( κ ^{*} ) π ( ξ ^{*} )}{π ( κ ^{(j - 1)} ) π ( ξ ^{(j - 1)} )} \times \frac{π ^ _{u^{(j)}} ( y ∣ κ ^{*} , ξ ^{*} , ϕ ^{(j)} )}{π ^ _{u^{(j)}} ( y ∣ κ ^{(j - 1)} , ξ ^{(j - 1)} , ϕ ^{(j)} )} \times \frac{q ( κ ^{(j - 1)} , ξ ^{(j - 1)} ∣ κ ^{*} , ξ ^{*} )}{q ( κ ^{*} , ξ ^{*} ∣ κ ^{(j - 1)} , ξ ^{(j - 1)} )}}

g (u^{i}) K (u^{i *} ∣ u^{i}) = g (u^{i *}) K (u^{i} ∣ u^{i *}) .

g (u^{i}) K (u^{i *} ∣ u^{i}) = g (u^{i *}) K (u^{i} ∣ u^{i *}) .

g (u^{i}) = N (u^{i}; 0, I_{d}) and K (u^{i *} ∣ u^{i}) = N (u^{i *}; ρ u^{i}, (1 - ρ^{2}) I_{d})

g (u^{i}) = N (u^{i}; 0, I_{d}) and K (u^{i *} ∣ u^{i}) = N (u^{i *}; ρ u^{i}, (1 - ρ^{2}) I_{d})

\displaystyle\left\{\begin{array}[]{ll}Y^{i}_{t}&=X_{t}^{i}+\epsilon_{t}^{i},\quad\epsilon^{i}_{t}\overset{indep}{\sim}\textrm{N}(0,\sigma_{\epsilon}^{2}),\qquad i=1,...,M\\ dX^{i}_{t}&=\theta_{1}^{i}(\theta^{i}_{2}-X_{t}^{i})dt+\theta_{3}^{i}dW_{t}^{i}.\end{array}\right.

\displaystyle\left\{\begin{array}[]{ll}Y^{i}_{t}&=X_{t}^{i}+\epsilon_{t}^{i},\quad\epsilon^{i}_{t}\overset{indep}{\sim}\textrm{N}(0,\sigma_{\epsilon}^{2}),\qquad i=1,...,M\\ dX^{i}_{t}&=\theta_{1}^{i}(\theta^{i}_{2}-X_{t}^{i})dt+\theta_{3}^{i}dW_{t}^{i}.\end{array}\right.

ϕ_{j}^{i} ∣ η \sim in d e p N (μ_{j}, τ_{j}^{- 1}), j = 1, 2, 3

ϕ_{j}^{i} ∣ η \sim in d e p N (μ_{j}, τ_{j}^{- 1}), j = 1, 2, 3

X_{t + Δ t}^{i} = θ_{2}^{i} + (X_{t}^{i} - θ_{2}^{i}) e^{- θ_{1}^{i} Δ t} + \frac{θ _{3}^{i^{2}}}{2 θ _{1}^{i}} (1 - e^{- 2 θ_{1}^{i} Δ t}) \times u_{t}^{i},

X_{t + Δ t}^{i} = θ_{2}^{i} + (X_{t}^{i} - θ_{2}^{i}) e^{- θ_{1}^{i} Δ t} + \frac{θ _{3}^{i^{2}}}{2 θ _{1}^{i}} (1 - e^{- 2 θ_{1}^{i} Δ t}) \times u_{t}^{i},

{μ_{j} ∣ τ_{j} \sim in d e p N (μ_{0_{j}}, M_{0_{j}} τ_{j}), j = 1, 2, 3, τ_{j} \sim in d e p G a (α_{j}, β_{j}),

{μ_{j} ∣ τ_{j} \sim in d e p N (μ_{0_{j}}, M_{0_{j}} τ_{j}), j = 1, 2, 3, τ_{j} \sim in d e p G a (α_{j}, β_{j}),

(μ_{0_{1}}, M_{0_{1}}, α_{1}, β_{1})

(μ_{0_{1}}, M_{0_{1}}, α_{1}, β_{1})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SamuelWiqvist/efficient_SDEMEM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Efficient inference for stochastic differential equation mixed-effects models using correlated particle pseudo-marginal algorithms

Samuel Wiqvist

[email protected]

Andrew Golightly

Ashleigh T. McLean

Umberto Picchini

Centre for Mathematical Sciences, Lund University, Sweden

School of Mathematics, Statistics and Physics, Newcastle University, UK

Mathematical Sciences, Chalmers University of Technology and the University of Gothenburg, Sweden

Abstract

Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that are able to account for random variability inherent in the underlying time-dynamics, as well as the variability between experimental units and, optionally, account for measurement error. Fully Bayesian inference for state-space SDEMEMs is performed, using data at discrete times that may be incomplete and subject to measurement error. However, the inference problem is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameter values of interest. The algorithm is made computationally efficient through careful use of blocking strategies and correlated pseudo-marginal Metropolis-Hastings steps within the Gibbs scheme. The resulting methodology is flexible and is able to deal with a large class of SDEMEMs. The methodology is demonstrated on three case studies, including tumor growth dynamics and neuronal data. The gains in terms of increased computational efficiency are model and data dependent, but unless bespoke sampling strategies requiring analytical derivations are possible for a given model, we generally observe an efficiency increase of one order of magnitude when using correlated particle methods together with our blocked-Gibbs strategy.

keywords:

Bayesian inference; random effects; sequential Monte Carlo; state-space model

††journal: Computational Statistics & Data Analysis

1 Introduction

Stochastic differential equations (SDEs) are arguably the most used and studied stochastic dynamic models. SDEs allow the representation of stochastic time-dynamics, and are ubiquitous in applied research, most notably in finance [1], systems biology [2], pharmacokinetic/pharmacodynamic modelling [3] and neuronal modelling. SDEs extend the possibilities offered by ordinary differential equations (ODEs), by allowing random dynamics. As such, they can in principle replace ODEs in practical applications, to offer a richer mathematical representation for complex phenomena that are intrinsically non-deterministic. However, in practice switching from ODEs to SDEs is usually far from trivial, due to the absence of closed form solutions to SDEs (except for the simplest toy problems), implying the need for numerical approximation procedures [4]. Numerical approximation schemes, while useful for simulation purposes, considerably complicate statistical inference for model parameters. For reviews of inference strategies for SDE models, see e.g. [5] (including Bayesian approaches) and [6] (classical approaches). Generally, in the non-Bayesian framework, the literature for parametric inference approaches for SDEs is vast, however there is no inference procedure that is applicable to general nonlinear SDEs and that is also easy to implement on a computer. This is due to the lack of explicit transition densities for most SDE models. The problem is particularly difficult for measurements that are observed without error, i.e. Markovian observations. On the other hand, the Bayesian literature offers powerful solutions to the inference problem, when observations arise from state-space models. In our case, this means that if we assume that observations are observed with error, and that the latent process is a Markov process, then the literature based on sequential Monte Carlo (particle filters) is readily available in the form of pseudo-marginal methods [7], and closely related particle MCMC methods [8], which we introduce in Section 4.

Our goal is to produce novel Gibbs samplers embedding special types of pseudo-marginal algorithms allowing for exact Bayesian inference in a specific class of state-space SDE models. In this paper, we consider “repeated measurement experiments”, modeled via mixed-effects, where the dynamics are Markov processes expressed via stochastic differential equations. These dynamics are assumed directly unobservable, i.e. are only observable up to measurement error. The practical goal is to fit observations pertaining to several “units” (i.e. independent experiments, such as measurements on different subjects) simultaneously, by formulating a state-space model having parameters randomly varying between the several individuals. The resulting model is typically referred to as a stochastic differential equation mixed-effects model (SDEMEM). SDEMEMs are interesting because, in addition to explaining intrinsic stochasticity in the time-dynamics, they also take into account random variation between experimental units. The latter variation permits the understanding of between-subjects variability within a population. When considered in a state-space model, these two types of variability (population variation and intrinsic stochasticity) are separated from the third source of variation, namely residual variation (measurement error). Thanks to their generality, and the ability to separate the three levels of variation, SDEMEMs have attracted attention, see e.g. [9] for a review and [10] for a more recent account. See also section 2 for a discussion on previous literature.

In the present work, we mainly focus on a general, plug-and-play approach for exact Bayesian inference in SDEMEMs, meaning that analytic calculations are not necessary thanks to the flexibility of the underlying sequential Monte Carlo (SMC) algorithms. We also describe a non plug-and-play approach to handle specific situations. As in [11], our random effects and measurement error can have arbitrary distributions, provided that the measurement error density can be evaluated point-wise. Unlike [11], we use a Gibbs sampler to target the marginal parameter posterior. Subject specific, common and random effect population parameters are updated in separate blocks, with pseudo-marginal Metropolis-Hastings (PMMH) steps used to update the subject specific and common parameters, and Metropolis-Hastings (MH) steps used to update the random effect population parameters. We believe that, to date, our work results in the most general plug-and-play approach to inference for state-space SDEMEMs (a similar method has been concurrently and independently introduced (July 25 2019 on arXiv), in [12]; see the discussion in Section 6). However, the price to pay for such generality is that the use of pseudo-marginal methods guided by SMC algorithms is computationally consuming. In order to make pseudo-marginal methods scale better as the number of observations is increased, we exploit recent advances based on correlated PMMH (CPMMH). We combine CPMMH with a novel blocking strategy and show that it is possible to reduce considerably the number of required particles, and hence reduce the computational requirements for exact Bayesian inference. In our experiments, unless specific models admit bespoke efficient sampling strategies (e.g. Section 5.3 where it was possible to implement an advanced particle filter), CPMMH based algorithms with our novel blocking strategy are one order of magnitude more efficient than standard PMMH. Occasionally we even observed a 40-fold increase in efficiency, as in Section 5.1.

The remainder of this paper is organized as follows. Background literature is discussed in Section 2. Stochastic differential mixed-effects models and the inference task are introduced in Section 3. Our proposed approach to inference is described in Section 4. Applications are considered in Section 5, including a simulation study considering an Ornstein-Uhlenbeck SDEMEM, a tumor-growth model and finally a challenging neuronal data case-study. A discussion is in Section 6. Julia and R codes can be found at https://github.com/SamuelWiqvist/efficient_SDEMEM.

2 Background literature

Here we rapidly review key papers on inference for SDEMEMs, and refer the reader to https://umbertopicchini.github.io/sdemem/ for a comprehensive list of publications. Early attempts at inference for SDEMEMs use methodology borrowed from standard (deterministic) nonlinear mixed-effects literature such as FOCE (first order conditional estimation) combined with the extended Kalman filter, as in [13]. This approach can only deal with SDEMEMs having a constant diffusion coefficient, see instead [14] for an extension to state-dependent diffusion coefficients. The resulting inference in [13] is approximate maximum likelihood estimation, and no uncertainty quantification is given. Moreover, only Gaussian random effects are allowed and measurement error is also assumed Gaussian. Other maximum likelihood approaches are in [15] and [16], where a closed-form series expansion for the unknown transition density is found using the method in [17], however the methodology can only be applied to reducible multivariate diffusions without measurement error. [18] discuss inference for SDEMEMs in a Bayesian framework. They implement a Gibbs sampler when the SDE (for each subject) has an explicit solution, and consider Gaussian random effects and Gaussian measurement error. When no explicit solution exists, they approximate the diffusion process using the Euler-Maruyama approximation. The approach of [19] is of particular interest, since it is the first attempt to employ particle filters for inference in SDEMEMs: they construct an exact maximum likelihood strategy based on stochastic approximation EM (SAEM), where latent trajectories are “proposed” via particle Markov chain Monte Carlo. The major problem with using SAEM is the need for sufficient summary statistics for the “complete likelihood”, which makes the methodology essentially impractical for arbitrarily complex models. [20] also use SAEM, but they avoid the need for the (usually unavailable) summary statistics for the complete likelihood, and propose trajectories using the extended Kalman filter instead of particle MCMC. Unlike in [19], the inference in [20] is approximate and measurement error and random effects are required to be Gaussian. [21] analyze multivariate diffusions under the conditions that the random effects are Gaussian distributed and that both fixed parameters and random effects enter linearly in the SDE. [22] work with the Euler-Maruyama approximation and adopt a data augmentation approach to integrate over the uncertainty associated with the latent diffusion process, by employing carefully designed bridge constructs inside a Gibbs sampler. A linear noise approximation (LNA) is also considered. However, the limitations are that the observation equation has to be a linear combination of the latent states and measurement error has to be Gaussian. In addition, producing the bridge construct in the data augmentation approach or the LNA-based likelihood requires some careful analytic derivations. Consequently, neither approach can be regarded as a plug-and-play method (that is, a method that only requires forward simulation and evaluation of the measurement error density). In [11], approximate and exact Bayesian approaches for a tumor growth study were considered: the approximate approach was based on synthetic likelihoods [23, 24], where summary statistics of the data are used for the inference, while exact inference used pseudo-marginal methodology via an auxiliary particle filter, which is suited to target measurements observed with a small error. It was found that using a particle approach to integrate out the random effects was very time consuming. Even though the data set was small (comprising 5-8 subjects to fit, depending on the experimental group, and around 10 observations per subject), the number of particles required to approximate each individual likelihood was in the order of thousands. This is very time consuming when the number of “subjects” (denoted $M$ in the rest of this work) increases.

3 Stochastic differential mixed-effects models

Consider the case where we have $M$ experimental units randomly chosen from a theoretical population. Our goal is to perform inference based on simultaneously fitting all data from the $M$ units. Now assume that the experiment we are analyzing consists in observing a stochastically evolving dynamic process, and that associated with each unit $i$ is a continuous-time $d$ -dimensional Itô process $\{X_{t}^{i},t\geq 0\}$ governed by the SDE

[TABLE]

Here, $\alpha$ is a $d$ -vector of drift functions, the diffusion coefficient $\beta$ is a $d\times d$ positive definite matrix with a square root representation $\sqrt{\beta}$ such that $\sqrt{\beta}\sqrt{\beta}^{T}=\beta$ , $W_{t}^{i}$ is a $d$ -vector of (uncorrelated) standard Brownian motion processes and $D^{i}$ are unit-specific static or time-dependent deterministic input (e.g. covariates, forcing functions), see e.g. [14]. The $p$ -vector parameter $\kappa=(\kappa_{1},\ldots,\kappa_{p})^{T}$ is common to all units whereas the $q$ -vectors $\phi^{i}=(\phi_{1}^{i},\ldots,\phi_{q}^{i})^{T}$ , $i=~{}1,\ldots,M$ , are unit-specific random effects. In the most general random effects scenario we let $\pi(\phi^{i}|\eta)$ denote the joint distribution of $\phi^{i}$ , parameterised by the $r$ -vector $\eta=(\eta_{1},\ldots,\eta_{r})^{T}$ . The model defined by (1) allows for differences between experimental units through different realizations of the Brownian motion paths $W_{t}^{i}$ and the random effects $\phi^{i}$ , accounting for inherent stochasticity within a unit, and variation between experimental units respectively.

We assume that each experimental unit $\{X_{t}^{i},t\geq 0\}$ cannot be observed exactly, but observations $y^{i}=(y_{1}^{i},\ldots,y_{n}^{i})^{T}$ are available. Without loss of generality, we assume units are observed at the same integer time points $\{1,2,...,n\}$ , that is in the following we write $n$ instead of, say, $n_{i}$ for all $i$ . However this is only for convenience of notation, and we could easily accommodate the possibility of different units $i$ having different values $n_{i}$ and that, in turn, units are observed at different sets of times. The observations are assumed conditionally independent (given the latent process) and we link them to the latent process via

[TABLE]

where $Y_{t}^{i}$ is a $d_{o}$ -vector, $\epsilon_{t}$ is a random $d_{o}$ -vector, $d_{o}\leq d$ , $\epsilon_{t}^{i}$ is the measurement noise, $S^{i}$ is (as $D^{i}$ ) a unit-specific deterministic input, and $h(\cdot)$ is a possibly nonlinear function of its arguments. In the applications in Section 5 we have $D^{i}=S^{i}=\emptyset$ , the empty set, for every $i$ , and hence for simplicity of notation we disregard $D^{i}$ and $S^{i}$ in the rest of the paper. However having non-empty sets does not introduce any additional complication to our methodology. Notice, the possibility to have $d_{0}<d$ implies that we may have some coordinate of the $\{X_{t}^{i}\}$ system that is unobserved at some (or all) $t$ . We denote the density linking $Y_{t}^{i}$ and $X_{t}^{i}$ by $\pi(y_{t}^{i}|x_{t}^{i},\xi)$ . An important special case that arises from our flexible observation model is when $h(X_{t}^{i},\epsilon_{t}^{i})=F^{T}X_{t}^{i}+\epsilon_{t}^{i}$ for a constant matrix $F$ and $\epsilon_{t}^{i}|\Sigma\overset{indep}{\sim}N(0,\Sigma)$ , allowing for observation of a linear combination of components of $X_{t}^{i}$ , subject to additive Gaussian noise. Notice that our methodology in Sections 3.1–4.4 can be applied to an arbitrary $h(\cdot)$ , provided this can be evaluated pointwise for any value of its arguments. For example, in Section 5.2 we have that $h(\cdot)$ is the logarithm of the sum of the components of a bivariate $X_{t}^{i}$ .

We refer to the model constituted by the system (1)-(2) as a SDEMEM. This is a state-space model, due to the Markov property of the Itô processes $\{X_{t}^{i},t\geq 0\}$ , and the assumption of conditional independence of observations on latent processes. The model is flexible: equation (1) explains the intrinsic stochasticity in the dynamics (via $\beta$ ) and the variation between-units (via the random effects $\phi^{i}$ ), while (2) explains residual variation (measurement error, via $\xi$ ).

3.1 Bayesian inference

Denote with $x=(x^{1},\ldots,x^{M})^{T}$ the set of unobserved states collected across all $M$ diffusion processes $\{X_{t}^{i}\}$ at the same set of integer times $\{1,2,...,n\}$ as for data $y=(y^{1},\ldots,y^{M})^{T}$ . Then given data $y=(y^{1},\ldots,y^{M})^{T}$ , latent values $x$ , the joint posterior for the common parameters $\kappa$ , fixed/random effects $\phi=(\phi^{1},\ldots,\phi^{M})^{T}$ , hyperparameters $\eta$ and measurement error parameters $\xi$ is

[TABLE]

where $\pi(\kappa,\eta,\xi)$ is the joint prior density ascribed to $\kappa$ , $\eta$ and $\xi$ . These three parameters may be assumed a priori independent, and then we can write $\pi(\kappa,\eta,\xi)=\pi(\kappa)\pi(\eta)\pi(\xi)$ , though this needs not be the case and we can easily assume a priori correlated parameters. In addition we have that

[TABLE]

and

[TABLE]

Note that $\pi(x^{i}_{j}|x^{i}_{j-1},\kappa,\phi^{i})$ will be typically intractable. In this case, we assume that it is possible to generate draws (up to arbitrary accuracy) from $\pi(x^{i}_{j}|x^{i}_{j-1},\kappa,\phi^{i})$ using a suitable numerical approximation. For example, the Euler-Maruyama approximation of (1) is

[TABLE]

and therefore

[TABLE]

where $\Delta W_{t}^{i}\sim N(0,I_{d}\Delta t)$ and the time-step $\Delta t$ , which need not be the inter-observation time, is chosen by the practitioner to balance accuracy and efficiency.

In what follows, we assume that interest lies in the marginal posterior for all parameters, given by $\pi(\kappa,\eta,\xi,\phi|y)=\int\pi(\kappa,\eta,\xi,\phi,x|y)dx$ , where

[TABLE]

This factorization suggests a Gibbs sampler with separate blocks for each parameter vector that sequentially takes draws from the full conditionals

$\pi(\phi|\kappa,\eta,\xi,y)\propto\prod_{i=1}^{M}\pi(\phi^{i}|\eta)\pi(y^{i}|\kappa,\xi,\phi^{i})$ , 2. 2.

$\pi(\kappa|\eta,\xi,\phi,y)=\pi(\kappa|\phi,\xi,y)\propto\pi(\kappa)\prod_{i=1}^{M}\pi(y^{i}|\kappa,\xi,\phi^{i})$ , 3. 3.

$\pi(\xi|\kappa,\eta,\phi,y)=\pi(\xi|\kappa,\phi,y)\propto\pi(\xi)\prod_{i=1}^{M}\pi(y^{i}|\kappa,\xi,\phi^{i})$ , 4. 4.

$\pi(\eta|\kappa,\xi,\phi,y)=\pi(\eta|\phi)\propto\pi(\eta)\prod_{i=1}^{M}\pi(\phi^{i}|\eta)$ .

Of course, in practice, the observed individual data likelihood $\pi(y^{i}|\kappa,\xi,\phi^{i})=\int p(y^{i},x^{i}|\kappa,\xi,\phi^{i})dx^{i}$ will be intractable. In what follows, therefore, we consider a Metropolis-within-Gibbs strategy, and in particular introduce auxiliary variables $u$ to allow pseudo-marginal Metropolis-Hastings updates.

4 A pseudo-marginal approach

Consider again the intractable target in (8) and suppose that we can unbiasedly estimate the intractable observed data likelihood $\pi(y|\kappa,\xi,\phi)=\int p(y,x|\kappa,\xi,\phi)dx$ . To this end let

[TABLE]

denote a (non-negative) unbiased estimator of $\pi(y|\kappa,\xi,\phi)$ , where $u=(u^{1},\ldots,u^{M})^{T}$ is the collection of auxiliary (vector) variables used to produce the corresponding estimate, with density $\pi(u)=\prod_{i=1}^{M}g(u^{i})$ . In the context of inference for SDEs, the $u$ may be the collection of pseudo-random standard Gaussian draws, these being necessary to simulate increments of the Brownian motion paths when implementing a numerical scheme such as Euler-Maruyama (Section 4.2), or produce draws from transition densities (in the rare instances when these are known). Notice in fact that the $u$ need not have a specific distribution, though in stochastic simulation we need access to pseudo-random variates that are often uniform or Gaussian distributed [25]. When inference methods use particle filters, pseudo-random variates are also employed in the resampling step, and hence these variates can be included into $u$ .

Now, the pseudo-marginal Metropolis-Hastings (PMMH) scheme targets

[TABLE]

for which it is easily checked that

[TABLE]

Hence, marginalising out $u$ gives the marginal parameter posterior in (8). Directly targeting the high dimensional posterior $\pi(\kappa,\eta,\xi,\phi,u|y)$ with PMMH is likely to give very small acceptance rates. The structure of the SDMEM naturally admits a Gibbs sampling strategy. We outline our novel Gibbs samplers in the next section.

4.1 Gibbs sampling and blocking strategies

The form of (10) immediately suggests a Gibbs sampler that sequentially takes draws from the full conditionals. However, we can design two types of Gibbs samplers. Our first, novel strategy is denoted “naive Gibbs”, where the $u^{i}$ are updated with both the subject specific and common parameters.

Naive Gibbs:

$\pi(\phi^{i},u^{i}|\kappa,\eta,\xi,y^{i})\propto\pi(\phi^{i}|\eta)\hat{\pi}_{u^{i}}(y^{i}|\kappa,\xi,\phi^{i})g(u^{i})$ , $i=1,\ldots,M$ , 2. 2.

$\pi(\kappa,u|\eta,\xi,\phi,y,u)=\pi(\kappa,u|\phi,\xi,y)\propto\pi(\kappa)\prod_{i=1}^{M}\hat{\pi}_{u^{i}}(y^{i}|\kappa,\xi,\phi^{i})g(u^{i})$ , 3. 3.

$\pi(\xi,u|\kappa,\eta,\phi,y,u)=\pi(\xi,u|\kappa,\phi,y)\propto\pi(\xi)\prod_{i=1}^{M}\hat{\pi}_{u^{i}}(y^{i}|\kappa,\xi,\phi^{i})g(u^{i})$ , 4. 4.

$\pi(\eta|\kappa,\xi,\phi,y,u)=\pi(\eta|\phi)\propto\pi(\eta)\prod_{i=1}^{M}\pi(\phi^{i}|\eta)$ .

Note that step 1 consists of a set of draws of $M$ conditionally independent random variables since

[TABLE]

Hence, step 1 gives a sample from $\pi(\phi,u|\kappa,\eta,\xi,y)$ . Draws from the full conditionals in 1-3 can be obtained by using Metropolis-Hastings within Gibbs. Taking the $[\phi^{i},u^{i}]$ block as an example, we use a proposal density of the form $q(\phi^{i*}|\phi^{i})g(u^{i*})$ and accept a move from $[\phi^{i},u^{i}]$ to $[\phi^{i*},u^{i*}]$ with probability

[TABLE]

Effectively, samples from the full conditionals in 1-3 are obtained via draws from pseudo-marginal MH kernels.

The above strategy is somewhat naive, since the auxiliary variables $u$ need only be updated once per Gibbs iteration, instead in steps 1 to 3 of the naive Gibbs procedure vectors $u^{i}$ are simulated anew in each of the three steps (notice $g(u^{i})$ appears in each of the first three steps). We therefore propose to update the blocks $[\phi^{i},u^{i}]$ , $i=1,\ldots,M$ in step 1 only, and condition on the most recent value of $u$ in the remaining steps. We call this second, novel strategy “blocked Gibbs”.

Blocked Gibbs:

$\pi(\phi^{i},u^{i}|\kappa,\eta,\xi,y^{i})\propto\pi(\phi^{i}|\eta)\hat{\pi}_{u^{i}}(y^{i}|\kappa,\xi,\phi^{i})g(u^{i})$ , $i=1,\ldots,M$ , 2. 2.

$\pi(\kappa|\eta,\xi,\phi,y,u)=\pi(\kappa|\phi,\xi,y,u)\propto\pi(\kappa)\prod_{i=1}^{M}\hat{\pi}_{u^{i}}(y^{i}|\kappa,\xi,\phi^{i})$ , 3. 3.

$\pi(\xi|\kappa,\eta,\phi,y,u)=\pi(\xi|\kappa,\phi,y,u)\propto\pi(\xi)\prod_{i=1}^{M}\hat{\pi}_{u^{i}}(y^{i}|\kappa,\xi,\phi^{i})$ , 4. 4.

$\pi(\eta|\kappa,\xi,\phi,y,u)=\pi(\eta|\phi)\propto\pi(\eta)\prod_{i=1}^{M}\pi(\phi^{i}|\eta)$ .

The aim of blocking in this way is to reduce the variance of the acceptance probability associated with steps 2 and 3, which involve the product of $M$ estimates as opposed to a single estimate in each constituent part of step 1. Also, notice $g(u^{i})$ appears only in the first step. The effect of blocking in this way is explored empirically in Section 5.

4.2 Estimating the likelihood

It remains that we can generate non-negative unbiased estimates $\hat{\pi}_{u}(y|\kappa,\xi,\phi)$ . This can be achieved by running a sequential Monte Carlo procedure, also known as particle filter. The simplest approach is to use the bootstrap particle filter [26, 27] (see also [28]) that, for a single experimental unit, recursively draws from the filtering distribution $\pi(x_{t}^{i}|y_{1:t}^{i},\kappa,\xi,\phi^{i})$ for each $t=1,\ldots,n$ . Here, $y_{1:t}^{i}$ denotes the observations of experiment $i$ for time-steps $1,\dots,t$ . Essentially, a sequence of importance sampling and resampling steps are used to propagate a weighted sample $\{(x_{t,k}^{i},w(u_{t,k}^{i})),k=1,\ldots,N_{i}\}$ from the filtering distribution, where $N_{i}$ is the number of particles for unit $i$ . Note that we let the weight depend explicitly on the $t$ -th component of the auxiliary variable $u^{i}=(u_{1}^{i},\ldots,u_{n}^{i})$ , associated with experimental unit $i$ . At time $t$ , the particle filter uses the approximation

[TABLE]

A simple importance sampling/resampling strategy follows, where particles are resampled (with replacement) in proportion to their weights, propagated via $x_{t,k}^{i}=f_{t}(u_{t,k}^{i})\sim\pi(\cdot|x_{t-1,k}^{i},\kappa,\phi^{i})$ and reweighted by $p(y_{t}^{i}|x_{t,k}^{i},\xi)$ . Here, $f_{t}(\cdot)$ is a deterministic function of $u_{t,k}^{i}$ (as well as the parameters and previous latent state, suppressed for simplicity) that gives an explicit connection between the particles and auxiliary variables. An example of $f_{t}(\cdot)$ is to take the Euler-Maruyama approximation

[TABLE]

where $u_{t,k}^{i}\sim N(0,I_{d})$ and $\Delta t$ is a suitably chosen time-step. In practice, unless $\Delta t$ is sufficiently small to allow an accurate Euler-Maruyama approximation, $f_{t}(u_{t,k}^{i})$ will describe recursive application of the numerical approximation.

Algorithm 1 provides a complete description of the bootstrap particle filter when applied to a single experimental unit. However notice the addition of a non-standard and optional sorting step 2b’, which turns useful when implementing a correlated pseudo-marginal approach, as described in Section 4.3. For the resampling step we follow [29] among others and use systematic resampling (see e.g. [30]), which only requires simulating a single uniform random variable at each time point. It is straightforward to augment the auxiliary variable $u^{i}$ to include the random variables used in the resampling step. As a by-product of the particle filter, the observed data likelihood $\pi(y^{i}|\kappa,\xi,\phi^{i})$ can be estimated via the quantity

[TABLE]

Moreover, the corresponding estimator can be shown to be unbiased [31, 32].

The full Gibbs sampler for generating draws from the joint posterior (10) is given by Algorithm 2. For ease of exposition, we have blocked the updates for $\kappa$ and $\xi$ , but note that the use of separate updates for these parameters is straightforward. The precise implementation of step 4 of the Gibbs sampler is likely to be example specific, and we anticipate that a direct draw of $\eta^{(j)}\sim\pi(\cdot|\phi^{(j)})$ will often be possible. For example when the components of $\phi$ are assumed to be normally distributed and $\eta$ consists of the corresponding means and precisions, for which a semi-conjugate prior specification is possible, see Section 5.1.

Executing Algorithm 2 requires $n\sum_{i=1}^{M}N_{i}$ draws from the transition density governing the SDE in (1) per iteration. In scenarios where the transition density is intractable, draws of a suitable numerical approximation are required. For example, we may use the Euler-Maruyama discretisation with time step $\Delta t=1/m$ , where $m\geq 1$ is chosen to limit the associated discretisation bias (and typically $m\gg 1$ ). In this case, order $mn\sum_{i=1}^{M}N_{i}$ draws of (7) are required. As discussed by [8], the number of particles per experimental unit, $N_{i}$ , should be scaled in proportion to the number of data points $n$ . Consequently, the use of PMMH kernels is likely to be computationally prohibitive in practice. We therefore consider the adaptation of a recently proposed correlated PMMH method for our problem.

4.3 A correlated pseudo-marginal approach

Consider again the task of sampling the full conditional $\pi(\phi^{i},u^{i}|\kappa,\eta,\xi,y^{i})$ associated with the $i$ th experimental unit. In steps 2(a–c) of Algorithm 2, a (pseudo-marginal) Metropolis-Hastings step is used whereby the auxiliary variables $u^{i}$ are proposed from the associated pdf $g(\cdot)$ (notice we could introduce a subject-specific $g_{i}(\cdot)$ , but we refrain from doing so in the interest of a lighter notation). As discussed by [29] (see also [33]), the proposal kernel need not be restricted to the use of $g(u^{i})$ . The correlated PMMH (CPMMH) scheme generalises the PMMH scheme by generating a new $u^{i*}$ from $K(u^{i*}|u^{i})$ where $K(\cdot|\cdot)$ satisfies the detailed balance equation

[TABLE]

It is then straightforward to show that a MH scheme with proposal kernel $q(\phi^{i*}|\phi^{i})K(u^{i*}|u^{i})$ and acceptance probability (13) satisfies detailed balance with respect to the target $\pi(\phi^{i},u^{i}|\kappa,\eta,\xi,y^{i})$ .

We take $g(u^{i})$ as a standard Gaussian density and $K(u^{i*}|u^{i})$ as the kernel associated with a Crank–Nicolson proposal [29]. Hence

[TABLE]

where $I_{d}$ is the identity matrix whose dimension $d$ is determined by the number of elements in $u^{i}$ . The parameter $\rho$ is chosen to be close to 1, to induce strong positive correlation between $\hat{\pi}_{u^{i}}(y^{i}|\kappa,\Sigma,\phi^{i})$ and $\hat{\pi}_{u^{i*}}(y^{i}|\kappa,\Sigma,\phi^{i*})$ , thus reducing the variance of the acceptance probability in (13), which is beneficial because it reduces the chance of accepting an overestimation of the likelihood function. Taking $\rho=0$ gives the special case that $K(u^{i*}|u^{i})=g(u^{i*})$ , which corresponds to the standard PMMH. Iteration $j$ of step 2 of Algorithm 2 then becomes

For $i=1,\ldots,M$ :

(a)

Propose $\phi^{i*}\sim q(\cdot|\phi^{i,(j-1)})$ . Draw $\omega\sim\textrm{N}(0,I_{d})$ and put $u^{i*}=\rho u^{i,(j-1)}+\sqrt{1-\rho^{2}}\omega$ .

(b)

Compute $\hat{\pi}_{u^{i*}}(y^{i}|\kappa^{(j-1)},\xi^{(j-1)},\phi^{i*})$ by running Algorithm 1 with $u^{i*}$ , $\phi^{i*}$ , $\kappa^{(j-1)}$ , $\xi^{(j-1)}$ and $y^{i}$ .

(c)

With probability given by (13) put $\phi^{i,(j)}=\phi^{i*}$ and $u^{i,(j)}=u^{i*}$ . Otherwise, store the current values $\phi^{i,(j)}=\phi^{i,(j-1)}$ and $u^{i,(j)}=u^{i,(j-1)}$ .

Care must be taken here when executing Algorithm 1 in Step 2(b). Upon changing $\phi^{i}$ and $u^{i}$ , the effect of the resampling step is likely to prune out different particles, thus breaking the correlation between successive estimates of observed data likelihood. Sorting the particles before resampling can alleviate this problem [29]. We follow [34] (see also [35]) by using a simple Euclidean sorting procedure which, for the case of a 1-dimensional latent state (e.g. when $\dim(X_{t}^{i})=1$ for every $t$ ) implies, prior to resampling the particles, to sort the particles from the smallest to the largest. This is step 2b’ in algorithm 1, denoted “optional” as it only applies to CPMMH, not PMMH.

4.4 Tuning the number of particles for likelihood approximation

It remains that we can choose the number of particles $N_{i}$ to be used to obtain estimates of the observed data likelihood contributions $\hat{\pi}_{u^{i}}(y^{i}|\kappa,\xi,\phi^{i})$ . Note that we allow a different number of particles per experimental unit to accommodate differing lengths of the $y^{i}$ and potential model misspecification at the level of an individual unit. In the case of PMMH, a simple strategy is to fix $\phi^{i}$ , $\kappa$ and $\xi$ at some central posterior value (obtained from a pilot run), and choose $N_{i}$ so that the variance of the log-likelihood (denoted $\sigma^{2}_{N_{i}}$ ) is around 2 [36, 37]. When using a CPMMH kernel, we follow [38, 34] by choosing $N_{i}$ so that $\sigma^{2}_{N_{i}}=2.16^{2}/(1-\rho_{l}^{2})$ where $\rho_{l}$ is the estimated correlation between $\log\hat{\pi}_{u^{i}}(y^{i}|\kappa,\xi,\phi^{i})$ and $\log\hat{\pi}_{u^{i*}}(y^{i}|\kappa,\xi,\phi^{i})$ . Hence, an initial pilot run (with the number of particles set at some conservative value) is required to determine plausible values of the parameters. This pilot run can also be used to give estimates of $var(\phi^{i}|y^{i})$ , $i=1,\ldots,M$ , each of which can subsequently be used as the innovation variance in a Gaussian random walk proposal for $\phi^{i}$ .

4.5 Tuning the proposal distributions

The block structure of the Gibbs sampler (Algorithm 2) requires two proposal densities: $\phi^{i*}\sim q(\cdot|\phi^{i,(j-1)})$ and $(\kappa^{*},\xi^{*})\sim q(\cdot|\kappa^{(j-1)},\xi^{(j-1)})$ that have to be chosen to achieve an algorithm that efficiently explores the posterior parameter space.

In Sections 5.1 and 5.3 we employ the generalized Adaptive Metropolis (AM) algorithm [39] to tune the two proposal distributions. Regarding the generation of proposals $\phi^{i*}$ , in the first step of the blocked Gibbs scheme we tune subject-specific proposal distributions, separately for each $\phi^{i*}$ . In addition to these $M$ proposal distributions we also tune a proposal distribution for $(\kappa^{*},\xi^{*})$ . Thus, we automatically tune overall $M+1$ proposal distributions via the generalized AM algorithm. Additionally, in Sections 5.1 and 5.3 we found that the use of different proposal distributions for each $\phi^{i*}$ was beneficial since random effects for the different subjects varied around very different values.

5 Applications

5.1 Ornstein-Uhlenbeck SDEMEM

We consider the following Ornstein-Uhlenbeck (OU) SDEMEM

[TABLE]

Here $\theta_{2}^{i}\in\mathbb{R}$ is the stationary mean for the $\{X_{t}^{i}\}$ process, $\theta_{1}^{i}>0$ is a growth rate (expressing how rapidly the system reacts to perturbations) and $\theta_{3}^{i}$ is the diffusion coefficient. The OU process is a standard toy-model in that it is completely tractable, that is the associated SDE has a known (Gaussian) transition density, e.g. [5]. This fact, coupled with the assumption that the $Y_{t}^{i}|X_{t}^{i}$ are conditionally Gaussian and linear in the latent states, implies that we can apply the Kalman filter to evaluate the likelihood function exactly. Therefore, exact inference is possible for the OU SDEMEM (both maximum likelihood and Bayesian). For all units $i$ we simulate $n=200$ observations, with constant observational time-step $\Delta t$ . In our setup, all random effects $(\theta_{1}^{i},\theta_{2}^{i},\theta_{3}^{i})$ are assumed strictly positive, and therefore we work with their log-transformed version and set $\phi^{i}=(\log\theta^{i}_{1},\log\theta^{i}_{2},\log\theta^{i}_{3})$ , where

[TABLE]

and $\eta=(\mu_{1},\mu_{2},\mu_{3},\tau_{1},\tau_{2},\tau_{3})$ , with $\tau_{j}$ the precision of $\phi_{j}^{i}$ . The SDEMEM (18) has no parameters $\kappa$ that are shared among subjects, and the full set of parameters that we want to infer is $(\mu_{1},\mu_{2},\mu_{3},\tau_{1},\tau_{2},\tau_{3},\sigma_{\epsilon})$ .

As already mentioned, we can compute the likelihood $\pi(y|\phi,\sigma_{\epsilon})=\prod\limits_{i=1}^{M}\pi(y^{i}|\phi^{i},\sigma_{\epsilon})$ exactly, using a Kalman filter (see [40] and [9] for a description pertaining SDEMEMs). The filter can then be used in Algorithm 2, that is we avoid using the particle filter (Algorithm 1) and replace it with the Kalman filter in Algorithm 2. Results from Algorithm 2 when using this approach are denoted with “Kalman”. The transition density for the latent state is known and therefore we do not need to use an Euler-Maruyama discretization when propagating the states forward in the particle filter. Instead we propagate the particles using the simulation scheme induced by the exact transition density:

[TABLE]

where $u^{i}_{t}\sim N(0,1)$ independently for all $t$ and all $i$ . Clearly, the $u_{t}^{i}$ appearing in (19) are among the variates that we will correlate, when implementing CPMMH, in addition to the variates produced in the resampling steps.

We compare “Kalman” to four further methods: “naive PMMH”, where we employ Algorithm 2 with the naive Gibbs scheme (see Section 4.1), “PMMH”, which is Algorithm 2, “CPMMH-099”, which is Algorithm 2 with a Crank-Nicolson proposal for the $u^{i}$ using a correlation of $\rho=0.99$ , and “CPMMH-0999” where we use a correlation of $\rho=0.999$ . The number of particle used for each method was selected using the methods described in Section 4.4. All five methods return exact Bayesian inference, and while this is obvious for “Kalman”, we remind the reader that this holds also for the other four approaches as these are instances of the pseudo-marginal approach. Therefore, special interest is in efficiency comparisons between the last four algorithms, “Kalman” being the obvious gold-standard.

We simulated data from the model in (18) with the following settings (data are in Figure 1): $M=40$ experimental units, $n=200$ observations for each unit using a time step $\Delta t=0.05$ , $\sigma_{\epsilon}=0.3$ , and $\eta=(\mu_{1},\mu_{2},\mu_{3},\tau_{1},\tau_{2},\tau_{3})=(-0.7,2.3,-0.9,4,10,4)$ . The prior for the observational noise standard deviation $\sigma_{\epsilon}$ was set to a Gamma distribution $Ga(1,0.4)$ , and the priors for the $\eta$ parameters were set to

[TABLE]

where,

[TABLE]

The priors in (20) are semi-conjugate and we can therefore use a tractable Gibbs step to sample $\eta$ in step 4 of Algorithm 2. An extended introduction to the semi-conjugate prior, including the tractable posterior can be found in [41].

We ran all four methods for $60$ k iterations, considering the first $10$ k iterations to be the burn-in period. We set the starting value for $\sigma_{\epsilon}$ at $\sigma_{\epsilon_{0}}=0.2$ , which is far from its ground truth value. The starting values for the random effects $\phi^{i}_{j}$ were set to their prior means. The proposal distributions were adaptively tuned using the generalized AM algorithm and the particle filters were implemented on a single-core computer, thus no parallelization was utilized. We used the same number of particles $N_{i}\equiv N$ for all units. Results are in Table 1 and Figures 2-3. As a reference for the efficiency of the considered samplers, we take the minimum ESS per minute (mESS/m in Table 1) as measured on PMMH-naive as “base/default” value and set it to 1 in the rightmost column of Table 1. The minimum ESS per minute for the other samplers are relative to the PMMH-naive value. The mESS value is computed over all parameter chains (including individual random effects), i.e. the chains for $\phi$ , $\sigma_{\epsilon}$ and $\eta$ . From Table 1 we conclude that CPMMH is about 20 to 40 times more efficient than PMMH in terms of mESS/m, depending on which correlation level we use. Furthermore, “Kalman” is about $5140$ times more efficient than PMMH. However, the latter comparison is not very interesting since the Kalman filter can be applied only to a very restricted class of models. The marginal posteriors in Figure 2–3 show that the several methods generate very similar posterior inference, which is reassuring. We left out the inference results from CPMMH-0999 for reasons of clarity. However we observed that with $N=50$ CPMMH-0999 produces a slightly biased inference for $\sigma_{\epsilon}$ , due to failing to adequately mix over the auxiliary variable $u$ , while inference for the remaining parameters is similar to the other considered methods. We verified (results not shown) that using $N=100$ is enough to repair this problem. From Figure 2–3 we can conclude that all parameters, with the possible exclusion of $\tau_{2}$ , are well inferred. Regarding $\tau_{2}$ , this is the precision for $\theta_{2}^{i}$ , the latter representing the stationary mean for a OU model. Clearly, by looking at Figure 1, the occasional outlier in the upper part of the Figure may contribute to underestimating the true precision of the stationary mean. To check if CPMMH indeed is necessary, we tried to run PMMH with 100 particles (i.e., the same number of particles as for CPMMH-099). The inference results produced with PMMH with 100 particles gave considerable mismatch (in terms of posterior output) for both the $\eta$ parameters and $\sigma_{\epsilon}$ relative to that obtained from CPMMH-099, resulting from the extremely poor mixing of the chain.

In summary, CPMMH is able to return reliable inference with a much smaller number of particles than PMMH, while resulting in a procedure that is about 20 to 40 times more efficient than PMMH (the 40-times figure is valid if we are ready to accept a small bias in $\sigma_{\epsilon}$ ). Again, for most models exact inference based on a closed-form expression for the likelihood function is unavailable, therefore being able to obtain accurate inference using a computationally cheaper version of PMMH is very appealing.

Notice that while for this simple case study PMMH-naive has the same mESS than PMMH, this is not the case for the case study in Section 5.2, where using the blocked-Gibbs sampler produces a much larger mESS value compared to naive-Gibbs.

5.1.1 Investigating the choice of number of particles

A crucial problem when running methods based on particle filters is the selection of the number of particles $N$ . In this section we investigate this problem by running CPMMH-099 and CPMMH-0999 with $N=[5,10,20,50,100]$ particles using 25 different (simulated) data sets. We also ran the Kalman algorithm using the 25 different data sets for comparison purposes. In this analysis, we are only interested in investigating the quality and computational efficiency of the inference. Hence, we initialised all algorithms at the ground truth parameter values and ran each algorithm for 60k iterations, and discarding the first 10k iterations as burnin period. We first estimated the Wasserstein distance, between the marginal posteriors for $\sigma_{\epsilon}$ and $\eta$ from the CPMMH algorithms and the corresponding Kalman-based marginal posteriors. This distance was computed via the POT package [42] (we do not compute the Wasserstein distance for the marginal posterior of the random effects $\phi^{i}$ , since this is not of central interest for us). All Wasserstein distances are based on the last 5k samples of the corresponding chains. To obtain a performance measure that takes into account both the quality of the inference and the computational effort, we multiply the Wasserstein distances by the runtimes (in minutes) of the CPMMH algorithms, and obtain the performance measure Wasserstein distance $\times$ runtime (m); see Figure 4 and 5. Smaller values of this measure are to be preferred as they indicate high computational efficiency and/or accurate inference. The reason for considering this performance measure is to take the quality of the inference into account, since for $N<20$ we noticed that it is possible to obtain chains that do not indicate adequate convergence within a reasonable time-frame.

We can conclude that, on average, results for different correlation levels are similar. However, for $\sigma_{\epsilon}$ we obtain a better performance when using more particles (lower Wasserstein distance $\times$ runtime (m) value), this resulting from inaccurate inference for $\sigma_{\epsilon}$ when using too few ( $N<50$ ) particles, leading to a large Wasserstein distance. However, this is not the case for $\eta$ since Figure 5 shows that the performance is better with fewer particles, a result that we obtain since the inference for $\eta$ is good even when using few particles (though not reported, in our analyses we observed that the Wasserstein distances for $\eta$ are similar across all attempted values of $N$ ). Thus, if we want to infer the measurement noise parameter $\sigma_{\epsilon}$ accurately, in this case we will have to use $N\geq 50$ particles, while the inference for $\eta$ is satisfactory, even with fewer particles.

Another issue that we analyse is the variability of mESS for the different data sets, based on 50k iterations of CPMMH. To investigate this we computed the 25th and 75th percentiles of mESS for CPMMH-099 with $N=100$ and CPMMH-0999 with $N=50$ based on the inference results on all unknown parameters from 25 simulated data sets. We obtain that the 25th and 75th percentiles of mESS for CPMMH-099 ( $N=100$ ) are $[227,240]$ , and for CPMMH-0999 ( $N=50$ ) are $[227,252]$ . Given that the several mESS are computed on different datasets, some degree of variation in the measure is expected and we conclude that the observed mESS variability is fairly small.

5.2 Tumor growth SDEMEM

We consider a stochastic differential mixed effects model that has been used to describe the tumor volume dynamics in mice receiving a treatment. Here we study a simplified version of the model in [11], and is given by

[TABLE]

for experimental units $i=1,\ldots,M$ . Here, $W_{1,t}$ and $W_{2,t}$ are uncorrelated Brownian motion processes, $X_{1,t}^{i}$ and $X_{2,t}^{i}$ are respectively the volume of surviving tumor cells and volume of cells killed by a treatment for mouse $i$ . Let $V_{t}^{i}=X_{1,t}^{i}+X_{2,t}^{i}$ denote the total tumor volume at time $t$ in mouse $i$ . The observation model is given by

[TABLE]

Let $\phi^{i}=(\log\beta^{i},\log\gamma^{i},\log\delta^{i},\log\psi^{i})$ . We complete the SDEMEM specification via the assumption that

[TABLE]

so that $\eta=(\mu_{1},\ldots,\mu_{4},\tau_{1},\ldots,\tau_{4})$ .

We recognise that $X_{1,t}^{i}$ and $X_{2,t}^{i}$ are geometric Brownian motion processes and (21) can be solved analytically to give

[TABLE]

where $logN(\cdot,\cdot)$ denotes the log-Normal distribution. Despite the availability of a closed form solution to the underlying SDE model, the observed data likelihood is intractable, due to the nonlinear form of (22) as a function of $\log(X_{1,t}^{i}+X_{2,t}^{i})$ . Nevertheless, a tractable approximation can be found, by linearising $\log V_{t}^{i}$ . The resulting linear noise approximation (LNA) is derived in B, and in what follows, we compare inference under the gold standard PMMH to that obtained under the LNA.

We mimicked the real data application in [11] by generating $21$ observations at integer times for $M=10$ replicates. We took

[TABLE]

and sampled $\phi^{i}_{j}|\eta$ using (23). The latent SDE process was then generated using (24) with an initial condition of $x_{0}^{i}=(75,75)^{T}$ (assumed known for all units), and each observation was corrupted according to (22) with $\sigma^{2}_{e}=0.2$ . The resulting data traces are consistent with the observations on total tumor volume of those subjects receiving chemo therapy in [11] and can be seen in Figure 6. We adopted semi conjugate, independent $\textrm{N}(-2,1)$ and $Ga(2,0.2)$ priors for the $\mu_{j}$ and $\tau_{j}$ respectively. We took $\log\sigma_{e}\sim\textrm{N}(0,1)$ to complete the prior specification. Given the use of synthetic data of equal length for each experimental unit, we pragmatically took the number of particles as $N_{i}=N$ , $i=1,\ldots,10$ . Our choice of $N$ was guided by the tuning advice of Section 4.4. For example, with CPMMH we obtain typical $\rho_{L}$ values of around 0.75, when parameter values are fixed at an estimate of the posterior mean. This gives $\sigma^{2}_{N}=10.6$ which is achieved with $N=7$ particles. To avoid potentially sticky behaviour of the chain in the posterior tails, we choose the conservative value $N=10$ . We compare four approaches: naive PMMH (where the $u^{i}$ are updated with both the subject specific and common parameters), PMMH (where the $u^{i}$ are only updated with the subject specific parameters – Algorithm 2), CPMMH (Algorithm 2 with a Crank-Nicolson proposal on the $u^{i}$ ) and the LNA based approach. We ran each scheme for $500k$ iterations. The results are summarised in Table 2 and Figure 7.

Figure 7 shows marginal posterior densities of the components of $\eta$ . We see that inferences for these parameters are consistent with the true values that generated the data (with similar results obtained for the other parameters) and that inference via CPMMH is consistent with that from the gold-standard PMMH. Similar results are obtained for $\sigma_{\epsilon}$ (not shown for brevity). At the same time, from Table 2 we note that CPMMH with $\rho=0.999$ is about 11 times more efficient than the naive PMMH and almost 3 times more efficient than PMMH with additional blocking. Finally, the LNA-based approach provides an accurate alternative to PMMH, except for $\tau_{4}$ . However, everything considered, CPMMH is to be preferred here as its computational efficiency is comparable to LNA, but unlike the latter, CPMMH provides accurate inference for all parameters, and unlike LNA the CPMMH approach is plug-and-play.

5.2.1 Use of the Euler-Maruyama approximation

We anticipate that for many applications of interest, an analytic solution of the underlying SDE will not be available. It is common place to use a numerical approximation in place of an intractable analytic solution. The simplest such approximation is the Euler-Maruyama (E-M) approximation. In this section, we investigate the effect of the E-M on the performance of PMMH and CPMMH for the tumor growth model.

The Euler-Maruyama approximation of (21) is

[TABLE]

where, for example, $\Delta X_{1,t}^{i}=X_{1,t+\Delta t}^{i}-X_{1,t}^{i}$ and $\Delta W_{1,t}^{i}\sim N(0,\Delta t)$ , with other terms defined similarly. To allow arbitrary accuracy of E-M, the inter-observation time length $\Delta t$ is replaced by a stepsize $\Delta t=1/L$ for the numerical integration, for integer $L\geq 1$ . We find that using $L=5$ (giving 4 intermediate times between observation instants) allows sufficient accuracy (compared to the analytic solution) to permit use of the same tuning choices when re-running PMMH (including the naive scheme) and CPMMH. Our findings are summarised by Table 3.

Unsurprisingly, inspection of Table 3 reveals that relative performance between the three computing pseudo-marginal schemes is similar to that obtained when using the analytic solution; CPMMH provides almost an order of magnitude increase in terms of mESS/m over a naive PMMH approach. We note that use of the Euler-Maruyama approximation requires computation and storage of an additional $1/\Delta t$ innovations per SDE component, inter-observation interval, particle and subject, thus accounting for the increase in CPU time compared to when using the analytic solution. Nevertheless, we find that our proposed approach is able to accommodate an intractable SDE scenario and provides a worthwhile increase in performance over competing approaches.

5.2.2 Comparison with ODEMEM

To highlight the potential issues that arise by ignoring inherent stochasticity, we consider inference for an ordinary differential equation mixed effects model (ODEMEM) of tumor growth. We take the SDEMEM in (21) and set $\gamma^{i}=\psi^{i}=0$ to give

[TABLE]

for $i=1,\ldots,M$ . The observation model and random effects distributions remain unchanged from (22) and (23) upon omitting $\log\gamma^{i}$ and $\log\psi^{i}$ from $\phi^{i}$ . The ODE system in (25) can be solved to give

[TABLE]

The likelihood associated with each experimental unit is then obtained simply as

[TABLE]

Fitting the ODEMEM to the synthetic data set from Section 5.2 is straightforward, via a Metropolis-within-Gibbs scheme. Figures 8 and 9 summarise our findings. Unsurprisingly, since the ODEMEM is unable to account for intrinsic stochasticity, the observation standard deviation is massively over-estimated. Figure 8 shows little agreement between the marginal posteriors under the ODEMEM and SDEMEM for this parameter. In terms of model fit, both the observation ( $Y_{t}^{1}$ ) and latent process ( $X_{t}^{1}=\log V_{t}^{1}$ ) predictive distributions for unit 1 are over concentrated for the ODEMEM. Similar results (not shown) are obtained for the other experimental units. Notably, from Figure 9, around half of the actual simulated $X_{t}$ values lie outside of the 95% credible interval under the ODEMEM.

5.3 Neuronal data

Here we consider a much more challenging problem: modelling a large number of observations pertaining neuronal data. In particular, we are interested in modelling the neuronal membrane potential across inter-spike intervals (ISIs). The problem of modelling the membrane potential from ISIs measurements using SDEs has already been considered numerous times, also using SDEMEMs, see [43]. In fact here we analyze the same data considered in [44] and [43], or actually a subset thereof, due to computational constraints. The “leaky integrate-and-fire” appears to be one of the most common models, in both artificial neural network applications and descriptions of biological systems. Deterministic and stochastic implementations of the model are possible. In the stochastic version, under specific assumptions [45], it coincides with the Ornstein-Uhlenbeck stochastic process and has been extensively investigated in the neuronal context, for instance in [46]. Consider Figure 10 as an illustrative example, reporting values of neuronal membrane depolarization studied in [47]. Inter-spike-intervals are the observations considered between “firing” times of the neuron, the latter being represented by the spikes appearing in Figure 10 (notice these are not the data we analysed. This figure is only used for illustration). Data corresponding to the near-deterministic spikes are removed, and what is left constitutes data from several ISIs. As in [43], we consider data from different ISIs as independent. Hence, $M$ is the number of considered ISIs. These are 312 in total, however, because of computational limitations, we will only analyze a subset of 100 ISIs, hence our results are based on $M=100$ and a total of 162,610 observations. A challenge is posed by the fact that some ISIs are much longer than others (in our case they vary between 600 and 2,500 observations), meaning that longer ISIs could typically require a larger $N$ to avoid particle depletion, but using the same large $N$ to approximate all $M$ likelihood terms would be a waste of computational resources. This is why CPMMH comes particularly useful, as it allows to keep a small $N$ across all units while still avoiding sticky behaviour in the MCMC chains. Data from the 100 ISIs are plotted on a common time-scale in Figure 11 (after some translation to let each ISI start approximately at zero value at time zero).

These consist of membrane potentials measured every 0.15 msec intracellularly from the auditory system of a guinea pig (for details on data acquisition and processing, see [48]).

Outside the mixed-effects context, if we denote the neuronal input with $\nu$ , and if the neuron is supposed to operate in a stationary state during some time of interest, then $\nu$ would be assumed constant during this period. [43] generalize by assuming that in addition to $\nu$ there is a random component changing from one ISI to the next, which could be caused by the naturally occurring variations of environment signaling, by experimental irregularities or by other sources of noise not included in the model. This fact can then be modeled by assuming that each ISI has its own input $\nu^{i}$ , and [43] specifically assume that the $\nu^{i}$ are iid Gaussian distributed with mean $\nu$ . An extension of the model in [43] is the following state-space type SDEMEM

[TABLE]

where the diffusion process $\{X_{t}^{i};t\geq 0\}$ models the membrane potential [mV] in the $i$ th ISI, with input $\nu^{i}$ $[\text{mV/msec}]$ . The spontaneous voltage decay (in the absence of input) for the $i$ th ISI is $(\lambda^{i})^{-1}$ $[\text{msec}]$ , which means that the stationary mean for $\{X_{t}^{i}\}$ is $\nu^{i}/\lambda^{i}$ , see e.g. [46] for details. The diffusion coefficients $\sigma^{i}$ have unit [mV/ $\sqrt{\mathrm{msec}}$ ]. Clearly, we assume that we are unable to observe $\{X_{t}^{i}\}$ directly, and instead can only observe a noisy realization from $\{Y_{t};t\geq 0\}$ . Differences with the SDEMEM in [43] are that: (i) their observations were assumed unaffected by measurement noise, i.e. observations were directly available from $\{X_{t}^{i};t\geq 0\}$ , $i=1,...,M$ , which is a convenient assumption easing calculations towards obtaining exact maximum likelihood estimation, but that it is generally possible to argue against; (ii) in [43] the only random effect was $\nu^{i}$ , and remaining parameters were fixed-effects, while in the present case we have random effects $\lambda^{i}$ and $\sigma^{i}$ in addition to $\nu^{i}$ . Of course here we also need to estimate $\sigma_{\epsilon}$ , which was not done in [43] since no measurement error was assumed.

As in Section 5.1 the random effects are constrained to be positive and we therefore define $\phi^{i}=(\phi^{i}_{1},\phi^{i}_{2},\phi^{i}_{3})=(\log\lambda^{i},\log\nu^{i},\log\sigma^{i})$ , where

[TABLE]

and $\eta=(\mu_{1},\mu_{2},\mu_{3},\tau_{1},\tau_{2},\tau_{3})$ , with $\tau_{j}$ the precision of $\phi_{j}^{i}$ . Since we here have a similar setting as in Section 5.1, we employ the same semi-conjugate priors with hyperparameters

[TABLE]

The considered data are measured with techniques ensuring high precision, and we assume the following prior $\log\sigma_{\epsilon}\sim\textrm{N}(-1,1)$ . Because of the small measurement noise, we expect that a bootstrap filter will perform poorly, leading to a very noisy approximation of the likelihood $\pi(y|\phi,\sigma_{\epsilon})=\prod\limits_{i=1}^{M}\pi(y^{i}|\phi^{i},\sigma_{\epsilon})$ . To be able to obtain a good approximation of the likelihood, we instead use the bridge particle filter found in [49], since, as explained below, the bootstrap filter is statistically inadequate for this experiment (moreover, it is also computationally inadequate, since it would require a too large number of particles, which was impossible to handle with the limited memory of our computer). In A, we derive the bridge filter for the model in (28), and we also compare the forward propagation of the particles that we obtain using the bootstrap filter and the bridge filter. In A.2 we see that the likelihood approximation obtained from the bootstrap filter is very inaccurate, which is due to its inability to handle measurements with small observational noise. Consequently, the number of particles required to give likelihood estimates with low variance is computationally prohibitive. Therefore, for this example, we only report results based on the bridge filter (which is not a plug-and-play method).

We use the following four algorithms already defined in Section 5.1: Kalman, which obviously here is the gold-standard method; PMMH, using the bridge filter with $N=1$ particle; CPMMH-0999 using the bridge filter also with 1 particle, and CPMMH-09 using the bridge filter with 1 particle. We find that, due to propagating particles conditional on the next observation, using a single particle was enough to give likelihood estimates with low variance. We ran all algorithms for 100k iterations, considering the first 20k iterations as burn-in. The starting value for $\sigma_{\epsilon}$ was set far away from the posterior mean that we obtained from a pilot run of the Kalman algorithm, and the starting values for the random effects $\phi^{i}_{j}$ were set to their prior means. For all algorithms, the proposal distributions were tuned adaptively using the generalized AM algorithm as described in Section 4.5. We ran the algorithms on a single-core computer so no parallelization was utilized. Posterior marginals in Figures 12-13 show that inference results for all algorithms are very similar, except for CPMMH-0999, for which posterior samples of $\sigma_{\epsilon}$ are inconsistent with the output from the other competing schemes. We note that the case of $N=1$ can be seen to correspond to a joint update of the parameters and latent process $x$ . Inducing strong positive correlation between successive values of $u$ therefore results in extremely slow mixing over the latent process and in turn, the parameters. This is particularly evident for $\sigma_{\epsilon}$ , whose update requires calculation of likelihood estimates over all experimental units. Reducing $\rho$ to 0.9 appears to alleviate this problem. Runtimes and ESS values are in Table 4. As expected, Kalman is the most efficient algorithm, being 19 times more efficient than PMMH is terms of ESS/min. However, here PMMH and CPMMH have the same efficiency in terms of ESS/min. Thus, CPMMH does not seem to produce any efficiency improvement for this case study. This is due to the efficiency of the bridge filter in guiding state proposals towards the next observation, and therefore allowing us to run PMMH with very few particles, thus making the potential improvement brought by CPMMH essentially null.

We compare our results with those in [43]. Since we have assumed that the random effects $\phi^{i}=(\phi^{i}_{1},\phi^{i}_{2},\phi^{i}_{3})=(\log\lambda^{i},\log\nu^{i},\log\sigma^{i})$ are Gaussian, then the $(\lambda^{i},\nu^{i},\sigma^{i})$ are log-Normal distributed with means $(\lambda,\nu,\sigma)$ and standard deviations $(\sigma_{\lambda},\sigma_{\nu},\sigma_{\sigma})$ respectively. By plugging the posterior means for $(\log\lambda^{i},\log\nu^{i},\log\sigma^{i})$ as returned by “Kalman” into the formulas for the mean and standard deviation of a lognormal distribution, we obtain that $\lambda=0.036$ $(\sigma_{\lambda}=0.009)$ [1/msec], $\nu=0.406$ $(\sigma_{\nu}=0.105)$ [mV/msec], and $\sigma=0.433$ , $(\sigma_{\sigma}=0.072)$ . In [43] we used a maximum likelihood approach, which is a fast enough procedure for Markovian data (there we did not assume a state-space model) that allowed us to obtain point estimates using all 312 ISIs (instead of 100 ISIs as in this case), but still slow enough to not permit bootstrapped confidence intervals to be obtained. Therefore, there we reported intervals based on asymptotic normality. There we had point estimates $\hat{\nu}=0.494$ and $\hat{\sigma}_{\nu}=0.072$ , which are similar to our Bayesian estimation. It makes sense that the inferences are not very different, as in the end our estimation of $\sigma_{\epsilon}$ is very small, meaning that we could assume nearly Markovian data. However here we have also inferences for random effects $\lambda^{i}$ and $\sigma^{i}$ , whereas in [43] these were assumed fixed (unknown) effects with maximum likelihood estimates $\hat{\lambda}=0.047$ [1/msec] (it can be obtained from Table 1 in [43] via $1/0.021=47.62$ [ $1/\mathrm{sec}$ ]) and $\hat{\sigma}=0.427$ [mV/ $\sqrt{\mathrm{msec}}$ ] (it can be obtained from Table 1 in [43] by converting 0.0135 [V/ $\sqrt{\mathrm{sec}}$ ] into [mV/ $\sqrt{\mathrm{msec}}$ ]). We appreciate how close our posterior means based on 100 ISIs are to the maximum likelihood estimates using 312 ISIs.

6 Discussion

We have constructed an efficient and general inference methodology for the parameters of stochastic differential equation mixed-effects models (SDEMEMs). While SDEMEMs are a flexible class of models for “population estimation”, their use has been limited by technical difficulties that make the execution of inference algorithms (both classic and Bayesian) computationally intensive. Our work proposed strategies to both (i) produce Bayesian inference for very general SDEMEMs, without the limitations of previous methods; (ii) alleviate the computational requirements induced by the generality of our methods. The SDEMEMs we considered are general in the sense that the underlying SDEs can be nonlinear in the states and in the parameters; the random parameters can have any distribution (not restricted to the Gaussian family); the observations equation does not have to be a linear combination of the latent states. We produced a Metropolis-within-Gibbs algorithm (hereafter Gibbs sampler, Algorithm 2) with carefully constructed blocking strategies, where the technically difficult approximation to the unavailable likelihood function is efficiently handled via correlated particle filters. The use of correlated particle filters brings in the well-known benefit of requiring fewer particles compared to the particle marginal Metropolis-Hastings (PMMH) algorithm. In our experiments, the novel blocked-Gibbs sampler embedding a correlated PMMH (CPMMH) shows that it is possible to considerably reduce the number of required particles while still obtaining a value of the effective sample size (ESS) that is comparable to using standard PMMH in the Gibbs sampler. This means that the Gibbs sampler with embedded CPMMH is computationally efficient and on two out of three examples of increasing complexity we found that our algorithm is much more efficient than a similar algorithm using the standard PMMH, sometimes even 40 times more efficient. Some care must be taken when choosing $\rho$ , which governs the level of correlation between successive likelihood estimates. Taking $\rho\approx 1$ can result in the sampler failing to adequately mix over the auxiliary variables. We found that this problem was exacerbated when using relatively few particles (such as $N=1$ ), but can be overcome by reducing $\rho$ . The fact that our approach is an instance of the pseudo-marginal methodology of [7] implies that we produce exact (simulation-based) Bayesian inference for the parameters of our SDEMEMs, regardless the number of particles used. We mostly focus on producing “plug-and-play” methodology (but see below for exceptions), meaning that no preliminary analytic calculations should be required to run our methods, and forward simulation from the SDEs simulator should be enough. Instead, what is necessary to set is the number of particles $N$ and, when correlated particles filters are used (CPMMH), the correlation parameter $\rho$ (however this one is easily set within the interval $[0.90,0.999]$ ). Finally, the usual settings for the MCMC proposal distribution should be decided (covariance matrix of the proposal function $q(\cdot)$ ). However, for the neuronal data example we had to employ a bridge filter, since the observational noise is very low for this case study, causing the bootstrap filter to perform poorly. The bridge filter is not plug-and-play (as discussed below), however in this paper we have decided to include a non-plug-and-play method to show how to analyze complex case studies with existing state-of-art sequential Monte Carlo filters. When considering a plug-and-play approach, our proposed methodology relies on the use of the bootstrap particle filter, within which particles are propagated according to the SDE solution or an approximation thereof. We note that in scenarios where the observations are particularly informative (e.g. the neuronal data case study in Section 5.3), it may be beneficial to propagate particles conditional on the observations, by using a carefully chosen bridge construct. We refer the reader to [35] for details on the use of such constructs within a CPMMH scheme for SDEs. However, notice that in order to use the constructs in [35] the conditional distribution of observations (i.e. (2) in our context) must be Gaussian. This is the underlying assumption that is exploited in [12] to enable the use of bridge constructs in inference for SDEMEMs. In [12] they also use methods based on correlated particle filters, in a work which has been proposed independently and concurrently to ours (July 25 2019 on arXiv). See for example their “component-wise pseudo-marginal” (CWPM) method, which is similar to the naive Gibbs strategy we also propose, and they found that CWPM was the best strategy among a battery of explored methods. In order to correlate the particles, [12] advocate the use of the blockwise pseudo-marginal strategy of [50]: this way, at each iteration of a CPMMH algorithm they randomly pick a unit in the set $\{1,..,M\}$ , and only for that unit they update the corresponding auxiliary variates, whereas for the remaining $M-1$ units they reuse the same auxiliary variates $u^{i}$ as employed in the last accepted likelihood approximation. This approach implies an estimated correlation between log-likelihoods of around $1-1/M$ , which also implies that the correlation level is completely guided by the number of units. This means that for a small $M$ (e.g. $M=5$ or 10, implying a correlation of 0.80 and 0.90 respectively) a blockwise pseudo-marginal strategy might not be as effective as it could be. On the other hand, assuming a very efficient and scalable implementation allowing measurements from $M=10,000$ units, the blockwise pseudo-marginal approach would produce highly correlated particles, which can sometimes be detrimental by not allowing enough variety in the auxiliary variates, and ultimately producing long-term correlations in the parameter chains, as we have documented in Section 5.3 when using a low number of particles $N$ . We therefore think it is advantageous to use a method that allows the statistician to decide on the amount of injected correlation: even though this means having one more parameter to set ( $\rho$ in our treatment), we find this decision to be rather straightforward, as mentioned above.

We hope this work can push forward the use of SDEMEMs in applied research, as even though inference methods for SDEMEMs have been available from around 2005, the limitation of theoretical or computational possibilities have implied that only specific SDEMEMs could be efficiently handled, while other SDEMEMs needed ad-hoc solutions or computationally very intensive algorithms. We believe our work is promising as a showcase of the possibility to employ very general SDEMEMs for practical applications.

Acknowledgments

SW was supported by the Swedish Research Council (Vetenskapsrådet 2013-05167). UP was supported by the Swedish Research Council (Vetenskapsrådet 2019-03924). We thank the staff at the Center for Scientific and Technical Computing at Lund University (LUNARC) for help in setting up the computer environment used for the computations in Section 5.1 and 5.3. We thank J. F. He for making the neuronal data available. We thank the editor and three anonymous reviewers for useful and insightful comments on this paper.

Appendix A Bridge particle filter

A.1 Deriving the bridge filter

This section is not strictly pertaining mixed-effects modelling, hence we disregard the subject’s index. We consider the bridge particle filter proposed in [49], with the exception that there an SDE was numerically solved using the Euler-Maruyama scheme. Here we provide the bridge particle filter for the special case where the exact (Gaussian) transition density is available, as considered for case studies in Sections 5.1 and 5.3. Since we do not require numerical discretization, in terms of the notation established in [49] we have that $m=1$ and $j=0$ . Furthermore, we let $\Delta{\text{obs}}$ denote the step-length for the observational times grid. Thus we have that $\Delta t=\Delta{\text{obs}}$ and $\Delta{j=0}=\Delta{\text{obs}}$ .

Here the bridge filter is derived for the example in section 5.3. The analytical transition density for the $X_{t}$ process in (5.3) is

[TABLE]

The joint density for $X_{t+\Delta t}$ and $Y_{t+\Delta t}$ , conditional on $X_{t}$ , is

[TABLE]

where $\alpha_{0}=x_{t}e^{-\lambda\Delta t}+\frac{\nu}{\lambda}(1-e^{-\lambda\Delta t})$ , and $\beta_{0}=\frac{\sigma^{2}}{2\lambda}(1-e^{-2\lambda\Delta t})$ . The conditional distribution used as proposal distribution in the bridge filter is

[TABLE]

where $\mu=\alpha_{0}+\beta_{0}(\beta_{0}+\sigma^{2}_{\epsilon})^{-1}(y_{t+\Delta t}-\alpha_{0})$ , $\Sigma=\beta_{0}(1-[\beta_{0}+\sigma^{2}_{\epsilon}]^{-1}\beta_{0})$ .

Equation (29) can be used to propagate particles forward, which is a much more efficient approach than in the bootstrap filter case, where the sampler is miopic to the next observation, while (29) is able to look-ahead towards the next observation $y_{t+\Delta t}$ . Thus, the bridge filter is similar in structure to Algorithm 1 with the difference that here the particles propagation step consists in sampling from (29), and the weights are given by

[TABLE]

A.2 Comparing the bootstrap filter and the bridge particle filter

To compare the performance of the bootstrap and the bridge filter, we run both filters with the same number of particles (500 particles for each subject) using the 100 ISIs neuronal data from Section 5.3. Parameters are set at the posterior means obtained from the Kalman algorithm. The comparison is interesting since it illustrates the well known issue of running particle filters when the observational error is small (here we have that $\sigma_{\epsilon}\approx 0.001$ ), and hence it is expected that the bootstrap filter will produce sub-optimal results. This is due to its inability to “target” the next observation, thus producing very small weights due to the small $\sigma_{\epsilon}$ . In Figure 14, we compare the forward propagation of the particles for one ISI chosen at random. It is evident that the bridge filter follows the data more closely. Furthermore, we run each filter independently for 100 times and compare the averages of the log-likelihood values, the standard deviation of the 100 log-likelihood estimations, and the runtimes, see Table 5. We can easily notice the superiority of the bridge filter returning an averaged log-likelihood value very close to the one provided by the Kalman filter. In particular, notice how the log-likelihood estimation is very unreliable (due to the small observation error).

We now compare the inference results for CPMMH when using the bridge filter and the bootstrap filter. We ran four algorithms: Kalman, PMMH with N = 1 particles using the bridge filter, CPMMH-09 with N = 1 particles using the bridge filter, CPMMH-099 with N = 100 particles using the bootstrap filter. We ran, Kalman, PMMH, and CPMMH-09 for 100k iterations, and ran CPMMH-099 for only 35k iterations, as this case is computationally more intensive. In Figure 15 we see that when using the bootstrap filter driven inference scheme, the $\sigma_{\epsilon}$ chain fails to adequately explore regions of high posterior density. We emphasise that this is due to using too few particles ( $N=100$ ). It is clear from Table 5 that the number of particles required to match the efficiency of the bridge filter is computationally infeasible. Marginal posteriors for the remaining parameters (not shown) are however similar for all algorithms. The reason why the population parameters $\eta$ appear to be unaffected by these issues, unlike $\sigma_{\epsilon}$ , is that step 4 of the Gibbs algorithms in section 4.1 (both versions, naive and blocked one) does not depend on the approximated likelihood, whereas step 2 (which samples $\sigma_{\epsilon}$ ) does depend on it.

Appendix B Tumor growth – Linear noise approximation

The linear noise approximation (LNA) can be derived in a number of more or less formal ways. We present a brief informal derivation here and refer the reader to [51] and the references therein for further details. We remark that the LNA is not a necessary feature of our general plug-and-play methodology outlined in Section 4 and Algorithm 2.

B.1 Setup

Consider the tumor growth model in (21), (22) and (23) and a single experimental unit so that the superscript $i$ can be dropped from the notation. To obtain a tractable observed data likelihood, we construct the linear noise approximation of $\log V_{t}=\log(X_{1,t}+X_{2,t})$ .

Let $Z_{t}=(Z_{1,t},Z_{2,t},Z_{3,t})^{T}=(\log V_{t},\log X_{1,t},\log X_{2,t})^{T}$ . The SDE satisfied by $Z_{t}$ can be found using the Itô formula, for which we obtain

[TABLE]

where

[TABLE]

We apply the linear noise approximation (LNA) by partitioning $Z_{t}$ as $Z_{t}=m_{t}+R_{t}$ where $m_{t}$ is a deterministic process satisfying

[TABLE]

and $\{R_{t},t\geq 0\}$ is a residual stochastic process satisfying

[TABLE]

By Taylor expanding $\alpha$ and $\beta$ about the deterministic process $m_{t}$ and retaining the first two terms in the expansion of $\alpha$ , and the first term in the expansion of $\beta$ , we obtain an approximate residual stochastic process $\{\tilde{R}_{t},t\geq 0\}$ satisfying

[TABLE]

where $J_{t}$ is the Jacobian matrix with $(i,j)$ th element $(J_{t})_{i,j}=\partial\alpha_{i}(m_{t},\phi)/\partial m_{j,t}$ . Assuming initial values $m_{0}=z_{0}$ and $\tilde{R}_{0}=0$ , the approximating distribution of $Z_{t}$ is given by

[TABLE]

where $m_{t}$ satisfies (30) and, after several calculations which we omit for brevity, $H_{t}$ is the solution to

[TABLE]

B.2 Inference

Note that the observation model in (22) can be written as

[TABLE]

where $P$ is a $3\times 1$ ‘observation vector’ with first entry 1 and zeroes elsewhere. The linearity of (31) and (33) yields a tractable approximation to the marginal likelihood $\pi(y|\phi,\sigma_{e})$ , which we denote by $\pi_{\textrm{LNA}}(y|\phi,\sigma_{e})$ . The approximate marginal likelihood $\pi_{\textrm{LNA}}(y|\phi,\sigma_{e})$ can be factorised as

[TABLE]

where $y_{1:i-1}=(y_{1},\ldots,y_{i-1})^{T}$ . Suppose that $Z_{1}\sim\textrm{N}(a,C)$ a priori, for some constants $a$ and $C$ . The marginal likelihood under the LNA, $\pi_{\textrm{LNA}}(y_{1:n}|\phi,\sigma_{e}):=\pi_{\textrm{LNA}}(y|\phi,\sigma_{e})$ can be obtained via a forward filter, which is given in Algorithm 3.

Inference for the SDEMEM defined by (21), (22) and (23) may be performed via a Gibbs sampler that draws from the following full conditionals

$\pi_{\mathrm{LNA}}(\phi|\eta,\sigma_{e},y)\propto\prod_{i=1}^{M}\pi(\phi^{i}|\eta)\pi_{\textrm{LNA}}(y^{i}|\sigma_{e},\phi^{i})$ , 2. 2.

$\pi_{\mathrm{LNA}}(\sigma_{e}|\eta,\phi,y)\propto\pi(\sigma_{e})\prod_{i=1}^{M}\pi_{\textrm{LNA}}(y^{i}|\sigma_{e},\phi^{i})$ , 3. 3.

$\pi(\eta|\sigma_{e},\phi,y)\propto\pi(\eta)\prod_{i=1}^{M}\pi(\phi^{i}|\eta)$ .

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. M. Steele, Stochastic calculus and financial applications, Vol. 45, Springer Science & Business Media, 2012.
2[2] D. J. Wilkinson, Stochastic Modelling for Systems Biology, 3rd Edition, Chapman & Hall/CRC Press, Boca Raton, Florida, 2018.
3[3] M. Lavielle, Mixed effects models for the population approach: models, tasks, methods and tools, Chapman and Hall/CRC, 2014.
4[4] P. E. Kloeden, E. Platen, Numerical solution of stochastic differential equations, Springer, 1992.
5[5] C. Fuchs, Inference for diffusion processes with applications in Life Sciences, Springer, 2013.
6[6] H. Sørensen, Parametric inference for diffusion processes observed at discrete points in time, International Statistical Review 72 (3) (2004) 337–354.
7[7] C. Andrieu, G. O. Roberts, The pseudo-marginal approach for efficient computation, Annals of Statistics 37 (2009) 697–725.
8[8] C. Andrieu, A. Doucet, R. Holenstein, Particle Markov chain Monte Carlo methods (with discussion), J. R. Statist. Soc. B 72 (3) (2010) 1–269.