Particle Methods for Stochastic Differential Equation Mixed Effects   Models

Imke Botha; Robert Kohn; Christopher Drovandi

arXiv:1907.11017·stat.CO·September 30, 2019

Particle Methods for Stochastic Differential Equation Mixed Effects Models

Imke Botha, Robert Kohn, Christopher Drovandi

PDF

1 Repo

TL;DR

This paper develops advanced particle MCMC methods tailored for stochastic differential equation mixed effects models, addressing intractable likelihoods and improving inference efficiency in complex biological data.

Contribution

It introduces three novel extensions to particle MCMC that exploit SDEMEM-specific features and correlated pseudo-marginal techniques for more efficient inference.

Findings

01

Enhanced inference efficiency demonstrated on real and simulated data

02

Extensions outperform naive approaches in computational speed

03

Applicable to complex biological models with intractable likelihoods

Abstract

Parameter inference for stochastic differential equation mixed effects models (SDEMEMs) is a challenging problem. Analytical solutions for these models are rarely available, which means that the likelihood is also intractable. In this case, exact inference is possible using the pseudo-marginal method, where the intractable likelihood is replaced by its nonnegative unbiased estimate. A useful application of this idea is particle MCMC, which uses a particle filter estimate of the likelihood. While the exact posterior is targeted by these methods, a naive implementation for SDEMEMs can be highly inefficient. We develop three extensions to the naive approach which exploits specific aspects of SDEMEMs and other advances such as correlated pseudo-marginal methods. We compare these methods on real and simulated data from a tumour xenography study on mice.

Tables12

Table 1. Table 1: Log-likelihood results for the IAPM method on the real dataset. The highlighted rows show the combinations which gave the best computation time.

		Prior			L-ODE			Lap-MDB
PF	Cor.	$L, N$	$σ_{Δ}$	time (s)	$L, N$	$σ_{Δ}$	time (s)	$L, N$	$σ_{Δ}$	time (s)
EMD	No	200	0.99	0.21	60	1.04	0.12	28	0.97	0.11
	Yes	90	1.00	0.11	30	1.02	0.10	19	0.99	0.09
MDB	No	180	1.02	0.36	35	0.87	0.11	4	1.02	0.05
	Yes	65	0.93	0.12	16	0.99	0.10	3	0.94	0.04
RB	No	180	1.02	0.38	35	0.85	0.11	4	1.02	0.05
	Yes	65	0.98	0.13	16	0.98	0.11	3	0.95	0.04

Table 2. Table 2: Log-likelihood results for the IAPM method on the sim(10,) datasets. The highlighted rows show the combinations which give the best computation time. The number of observations for each dataset (from left to right): 200, 390, 4,570.

			sim(10, 24)			sim(10, 12)			sim(10, 1)
IS	PF	Cor.	$L, N$	$σ_{Δ}$	time (s)	$L, N$	$σ_{Δ}$	time (s)	$L, N$	$σ_{Δ}$	time (s)
Prior	EMD	No	250	0.97	1.69	370	1.04	6.29	530	0.96	134.1
		Yes	115	0.99	0.54	130	1.00	1.13	335	0.99	52.58
	MDB	No	220	0.95	3.06	220	1.03	5.84	570	0.97	373.3
		Yes	95	0.91	0.86	100	0.93	1.63	250	1.03	80.02
	RB	No	220	0.96	3.07	220	1.03	6.14	570	0.98	385.9
		Yes	95	0.93	0.87	100	0.95	1.79	250	1.01	87.62
L-ODE	EMD	No	220	1.04	1.31	950	1.02	35.9	-	-	-
		Yes	60	0.99	0.32	120	0.98	0.99	370	1.00	62.71
	MDB	No	145	1.03	1.57	800	1.02	55.75	-	-	-
		Yes	20	1.00	0.22	50	1.04	0.78	310	0.98	118.0
	RB	No	145	1.03	1.62	800	1.02	60.45	-	-	-
		Yes	20	1.00	0.21	50	1.04	0.81	310	0.95	126.7
Lap-MDB	EMD	No	40	1.00	0.20	55	0.96	0.45	150	1.03	14.22
		Yes	45	1.00	0.25	75	1.02	0.57	290	0.96	38.61
	MDB	No	8	1.01	0.12	16	0.99	0.23	120	0.99	26.22
		Yes	4	0.88	0.10	10	0.98	0.21	190	1.00	52.18
	RB	No	8	1.01	0.12	16	0.97	0.25	120	0.97	29.70
		Yes	4	0.90	0.11	10	0.97	0.25	190	0.98	53.87

Table 3. Table 3: Log-likelihood results for the IAPM method on the sim(100,) datasets. The highlighted rows show the combinations which give the best computation time. The number of observations for each dataset (from left to right): 2,000, 3,900, 45,700.

			sim(100, 24)			sim(100, 12)			sim(100, 1)
IS	PF	Cor.	$L, N$	$σ_{Δ}$	time (s)	$L, N$	$σ_{Δ}$	time (s)	$L, N$	$σ_{Δ}$	time (s)
Prior	EMD	No	500	1.02	52.78	500	1.01	105.7	-	-	-
		Yes	95	1.00	3.28	110	0.96	8.3985	300	1.01	419.8
	MDB	No	300	1.03	49.71	390	1.01	154.0	-	-	-
		Yes	45	0.97	2.88	45	1.00	6.16	200	1.03	556.8
	RB	No	320	0.97	60.60	390	1.00	174.0	-	-	-
		Yes	45	0.96	3.07	45	0.95	6.05	200	0.98	584.8
L-ODE	EMD	No	-	-	-	-	-	-	-	-	-
		Yes	155	1.00	7.51	370	1.03	76.64	-	-	-
	MDB	No	-	-	-	-	-	-	-	-	-
		Yes	100	0.97	8.13	230	1.00	57.27	-	-	-
	RB	No	-	-	-	-	-	-	-	-	-
		Yes	100	1.00	8.75	230	1.02	65.51	-	-	-
Lap-MDB	EMD	No	130	1.00	5.55	140	1.03	11.70	-	-	-
		Yes	65	1.04	2.42	80	1.00	5.26	300	0.97	417.7
	MDB	No	30	0.96	2.12	50	0.96	7.48	-	-	-
		Yes	4	0.91	0.63	10	0.98	1.53	200	0.99	532.1
	RB	No	30	0.99	2.53	50	0.98	7.92	-	-	-
		Yes	4	0.94	0.68	10	0.99	1.71	200	0.95	587.6

Table 4. Table 4: Log-likelihood results for the IAPM method on the sim(1000,) datasets. The highlighted rows show the combinations which give the best computation time. The number of observations for each dataset (from left to right): 20,000, 39,000, 457,000.

			sim(1000, 24)			sim(1000, 12)			sim(1000, 1)
IS	PF	Cor.	$L, N$	$σ_{Δ}$	time (s)	$L, N$	$σ_{Δ}$	time (s)	$L, N$	$σ_{Δ}$	time (s)
Prior	EMD	No	-	-	-	-	-	-	-	-	-
		Yes	-	-	-	-	-	-	-	-	-
	MDB	No	-	-	-	-	-	-	-	-	-
		Yes	-	-	-	-	-	-	-	-	-
	RB	No	-	-	-	-	-	-	-	-	-
		Yes	-	-	-	-	-	-	-	-	-
L-ODE	EMD	No	-	-	-	-	-	-	-	-	-
		Yes	-	-	-	-	-	-	-	-	-
	MDB	No	-	-	-	-	-	-	-	-	-
		Yes	-	-	-	-	-	-	-	-	-
	RB	No	-	-	-	-	-	-	-	-	-
		Yes	-	-	-	-	-	-	-	-	-
Lap-MDB	EMD	No	350	1.04	285.5	400	1.05	701.2	-	-	-
		Yes	65	0.97	24.95	80	1.03	50.76	-	-	-
	MDB	No	90	1.04	71.46	145	1.02	292.2	-	-	-
		Yes	4	0.92	6.50	10	1.01	15.65	-	-	-
	RB	No	90	1.05	77.83	145	0.98	315.3	-	-	-
		Yes	4	0.94	7.20	10	1.00	16.63	-	-	-

Table 5. Table 5: Log-likelihood results for the CWPM method on the real data. The highlighted rows show the combinations which give the best time.

PF	Cor.	$L, N$	$σ_{Δ}$	time (s)
EMD	No	200	1.02	0.0044
	Yes	60	0.98	0.0030
MDB	No	1	0.34	0.0023
	Yes	1	0.16	0.0023
RB	No	1	0.33	0.0024
	Yes	1	0.17	0.0024

Table 6. Table 6: Log-likelihood results for the CWPM method on the sim(10,) datasets. The highlighted rows show the combinations which give the best time.

		sim(10, 24)			sim(10, 12)			sim(10, 1)
PF	Cor.	$N$	$σ_{Δ}$	time (s)	$N$	$σ_{Δ}$	time (s)	$N$	$σ_{Δ}$	time (s)
EMD	No	450	1.02	0.0578	700	1.02	0.0915	3100	1.03	1.8672
	Yes	65	1.00	0.0559	85	1.02	0.0583	300	0.95	0.2773
MDB	No	30	1.00	0.0490	110	0.96	0.0728	2100	1.02	3.1266
	Yes	3	0.98	0.0470	10	1.04	0.0554	215	1.00	0.4922
RB	No	30	1.00	0.0503	110	0.96	0.0743	2100	1.02	3.3687
	Yes	3	0.97	0.0473	10	1.02	0.0578	210	0.99	0.54

Table 7. Table 7: Log-likelihood results for the CWPM method on the sim(100,) datasets. The highlighted rows show the combinations which give the best time.

		sim(100, 24)			sim(100, 12)			sim(100, 1)
PF	Cor.	$N$	$σ_{Δ}$	time (s)	$N$	$σ_{Δ}$	time (s)	$N$	$σ_{Δ}$	time (s)
EMD	No	6500	1.04	1.1254	9000	0.99	2.9294	-	-	-
	Yes	120	1.03	0.1107	120	1.05	0.1536	360	1.04	2.783
MDB	No	350	0.97	0.2385	1200	1.00	1.0330	-	-	-
	Yes	3	1.05	0.1038	11	1.03	0.1492	240	0.99	3.544
RB	No	350	0.99	0.2527	1200	0.99	1.13	-	-	-
	Yes	3	1.05	0.1035	11	1.01	0.1523	240	0.99	4.019

Table 8. Table 8: Log-likelihood results for the CWPM method on the sim(1000,) datasets. The highlighted rows show the combination which give the best time.

		sim(1000, 24)			sim(1000, 12)			sim(1000, 1)
PF	Cor.	$N$	$σ_{Δ}$	time (s)	$N$	$σ_{Δ}$	time (s)	$N$	$σ_{Δ}$	time (s)
EMD	No	-	-	-	-	-	-	-	-	-
	Yes	90	1.03	0.4702	110	1.03	0.9162	-	-	-
MDB	No	3500	0.98	12.76	11000	1.04	78.72	-	-	-
	Yes	3	1.05	0.4504	12	0.99	0.8920	-	-	-
RB	No	3500	1.01	13.91	11000	1.02	86.91	-	-	-
	Yes	3	1.04	0.4906	12	0.96	0.9433	-	-	-

Table 9. Table 9: MCMC results for all methods on all datasets. Results are calculated from chains of length 100,000. Dashed lines indicate that the method was not computationally feasible on that particular dataset.

Data	Method	MultiESS	time (min)	MultiESS/time
Real	Naive	802	669	1.20
	IAPM	733	77	9.48
	CWPM	1431	6	242.90
	MPM	1719	41	41.82
sim(10, 24)	Naive	2439	4802	0.51
	IAPM	1289	142	9.11
	CWPM	3064	180	17.00
	MPM	3371	448	7.53
sim(10, 12)	Naive	-	-	-
	IAPM	1504	351	4.29
	CWPM	3028	211	14.38
	MPM	4503	479	9.39
sim(10, 1)	Naive	-	-	-
	IAPM	-	-	-
	CWPM	3197	2127	1.50
	MPM	5607	4786	1.17
sim(100, 24)	Naive	-	-	-
	IAPM	1174	971	1.21
	CWPM	3012	430	7.01
	MPM	3663	1088	3.37
sim(100, 12)	Naive	-	-	-
	IAPM	1181	2849	0.41
	CWPM	2485	706	3.52
	MPM	3541	1634	2.17
sim(1000, 24)	Naive	-	-	-
	IAPM	-	-	-
	CWPM	2742	1609	1.70
	MPM	3402	4644	0.73
sim(1000, 12)	Naive	-	-	-
	IAPM	-	-	-
	CWPM	1875	3158	0.59
	MPM	-	-	-

Table 10. Table 10: MultiESS breakdown for each parameter block. The 𝜽 𝜽 \bm{\theta} column shows the multiESS for all parameters.

Data	Method	$𝜽$	$(γ, σ, ρ)$	$(μ_{X 0}, σ_{X 0})$	$(μ_{β}, σ_{β})$
Real	CWPM	1431.4	515.3	4520.5	1279.3
	MPM	1718.9	716.4	4313.5	1588.2
sim(10, 24)	CWPM	3063.8	1197.6	8576.3	3452.5
	MPM	3370.7	1821.9	9114.9	2530.0
sim(10, 12)	CWPM	3027.8	1397.9	8710.3	2931.9
	MPM	4503.1	3347.2	8913.9	3113.9
sim(10, 1)	CWPM	3197.3	1921.2	6959	2955.3
	MPM	5606.9	7666.7	7502	2416.9
sim(100, 24)	CWPM	3011.5	1026.0	8777.7	3199.0
	MPM	3663.1	1451.8	10037.0	3302.3
sim(100, 12)	CWPM	2484.9	653.0	10660.0	3160.1
	MPM	3540.7	1566.7	9439.2	2793.2
sim(1000, 24)	CWPM	2741.6	648.7	7014.0	3649.6
	MPM	3402.4	1142.8	9989.3	3266.7
sim(1000, 12)	CWPM	1875.1	340.93	9180.6	2657.2
	MPM	-	-	-	-

Table 11. Table 11: Acceptance rates for the IAPM method based on a chain of length 100,000.

Data	$A R (𝜽)$
Real	0.05
sim(10, 24)	0.09
sim(10, 12)	0.11
sim(10, 1)	-
sim(100, 24)	0.10
sim(100, 12)	0.11
sim(1000, 24)	-
sim(1000, 12)	-

Table 12. Table 12: Acceptance rates for the CWPM and MPM methods based on a chain of length 100,000.

Data	Method	$A R (γ, σ, ρ)$	$A R (μ_{X 0}, σ_{X 0}, μ_{β}, σ_{β})$
Real	CWPM	0.08	0.63
	MPM	0.21	0.63
sim(10, 24)	CWPM	0.12	0.60
	MPM	0.16	0.61
sim(10, 12)	CWPM	0.12	0.56
	MPM	0.18	0.57
sim(10, 1)	CWPM	0.15	0.57
	MPM	0.27	0.58
sim(100, 24)	CWPM	0.13	0.63
	MPM	0.20	0.63
sim(100, 12)	CWPM	0.06	0.61
	MPM	0.11	0.61
sim(1000, 24)	CWPM	0.08	0.66
	MPM	0.15	0.66
sim(1000, 12)	CWPM	0.01	0.64
	MPM	-	-

Equations140

P (x_{t}, y_{t} ∣ x_{0 : t - 1}, y_{0 : t - 1}, θ)

P (x_{t}, y_{t} ∣ x_{0 : t - 1}, y_{0 : t - 1}, θ)

= P (y_{t} ∣ x_{t}, x_{t - 1}, y_{t - 1}, θ) P (x_{t} ∣ x_{t - 1}, y_{t - 1}, θ) .

P (y_{t} ∣ x_{t}, x_{t - 1}, y_{t - 1}, θ) = g (y_{t} ∣ x_{t}, θ), P (x_{t} ∣ x_{t - 1}, y_{t - 1}, θ) = f (x_{t} ∣ x_{t - 1}, θ),

P (y_{t} ∣ x_{t}, x_{t - 1}, y_{t - 1}, θ) = g (y_{t} ∣ x_{t}, θ), P (x_{t} ∣ x_{t - 1}, y_{t - 1}, θ) = f (x_{t} ∣ x_{t - 1}, θ),

P (x_{0 : T - 1}, θ ∣ y_{0 : T - 1}) \propto P (y_{0 : T - 1} ∣ x_{0 : T - 1}, θ) P (x_{0 : T - 1} ∣ θ) π (θ),

P (x_{0 : T - 1}, θ ∣ y_{0 : T - 1}) \propto P (y_{0 : T - 1} ∣ x_{0 : T - 1}, θ) P (x_{0 : T - 1} ∣ θ) π (θ),

P (y_{0 : T - 1} ∣ x_{0 : T - 1}, θ) =

P (y_{0 : T - 1} ∣ x_{0 : T - 1}, θ) =

P (x_{0 : T - 1} ∣ θ) = μ (x_{0} ∣ θ)

P (θ ∣ y_{0 : T - 1}) \propto π (θ) P (y_{0 : T - 1} ∣ θ),

P (θ ∣ y_{0 : T - 1}) \propto π (θ) P (y_{0 : T - 1} ∣ θ),

P (y_{0 : T - 1} ∣ θ) = \int_{X} P (y_{0 : T - 1} ∣ x_{0 : T - 1}, θ) P (x_{0 : T - 1} ∣ θ) d x_{0 : T - 1},

P (y_{0 : T - 1} ∣ θ) = \int_{X} P (y_{0 : T - 1} ∣ x_{0 : T - 1}, θ) P (x_{0 : T - 1} ∣ θ) d x_{0 : T - 1},

d X_{t} = μ (X_{t}, ϕ_{X}, t) d t + v (X_{t}, ϕ_{X}, t) d B_{t}, X_{0} = X_{0} (ϕ_{X})

d X_{t} = μ (X_{t}, ϕ_{X}, t) d t + v (X_{t}, ϕ_{X}, t) d B_{t}, X_{0} = X_{0} (ϕ_{X})

d X_{m, t} = μ (X_{m, t}, ϕ_{X}, η_{m}) d t + v (X_{m, t}, ϕ_{X}, η_{m}) d B_{m, t}, X_{m 0} = X_{m 0} (ϕ_{X}, η_{m}) .

d X_{m, t} = μ (X_{m, t}, ϕ_{X}, η_{m}) d t + v (X_{m, t}, ϕ_{X}, η_{m}) d B_{m, t}, X_{m 0} = X_{m 0} (ϕ_{X}, η_{m}) .

y_{m, t} ∣ x_{m, t}, σ^{2} \sim N (y_{m, t}; x_{m, t}, σ^{2}) .

y_{m, t} ∣ x_{m, t}, σ^{2} \sim N (y_{m, t}; x_{m, t}, σ^{2}) .

P (θ, η_{1 : M}, x_{1 : M} ∣ y_{1 : M}) \propto P (θ) m = 1 \prod M P (y_{m} ∣ x_{m}, θ) P (x_{m} ∣ θ, η_{m}) P (η_{m} ∣ θ) .

P (θ, η_{1 : M}, x_{1 : M} ∣ y_{1 : M}) \propto P (θ) m = 1 \prod M P (y_{m} ∣ x_{m}, θ) P (x_{m} ∣ θ, η_{m}) P (η_{m} ∣ θ) .

X_{m, t} = β_{m} d t + γ d B_{t}, X_{m, 0} = x_{0},

X_{m, t} = β_{m} d t + γ d B_{t}, X_{m, 0} = x_{0},

lo g (β_{m}) \sim N (lo g (β_{m}); μ_{β}, σ_{β}^{2}),

f (x_{m, t} ∣ x_{m, t - 1}, β_{m}, γ, x_{0}) = N (x_{m, t}; x_{m, t - 1} + β_{m}, γ^{2}) .

f (x_{m, t} ∣ x_{m, t - 1}, β_{m}, γ, x_{0}) = N (x_{m, t}; x_{m, t - 1} + β_{m}, γ^{2}) .

⎩ ⎨ ⎧ g (y_{m, t} ∣ x_{m, t}, θ) = N (y_{m, t}; x_{m, t}, σ^{2}) f (x_{m, t} ∣ x_{m, t - 1}, η_{m}, θ) = N (x_{m, t}; x_{m, t - 1} + β_{m}, γ^{2}) P (η_{m} ∣ θ) = N (η_{m}; μ_{β}, σ_{β}^{2})

⎩ ⎨ ⎧ g (y_{m, t} ∣ x_{m, t}, θ) = N (y_{m, t}; x_{m, t}, σ^{2}) f (x_{m, t} ∣ x_{m, t - 1}, η_{m}, θ) = N (x_{m, t}; x_{m, t - 1} + β_{m}, γ^{2}) P (η_{m} ∣ θ) = N (η_{m}; μ_{β}, σ_{β}^{2})

d X_{t} = μ (X_{t}, ϕ_{X}, η) d t + σ (X_{t}, ϕ_{X}, η) d B_{t}, X_{0} = X_{0} (ϕ_{X}, η) .

d X_{t} = μ (X_{t}, ϕ_{X}, η) d t + σ (X_{t}, ϕ_{X}, η) d B_{t}, X_{0} = X_{0} (ϕ_{X}, η) .

0 = τ_{0} < τ_{1} < \dots < τ_{k} < τ_{k + 1} < \dots < τ_{D} = J, Δ τ = \frac{J}{D} .

0 = τ_{0} < τ_{1} < \dots < τ_{k} < τ_{k + 1} < \dots < τ_{D} = J, Δ τ = \frac{J}{D} .

μ (X_{τ_{k}}, ϕ_{X}, η)

μ (X_{τ_{k}}, ϕ_{X}, η)

v (X_{τ_{k}}, ϕ_{X}, η)

X_{τ_{k + 1}}

X_{τ_{k + 1}}

Δ B_{τ_{k}}

x_{τ_{k + 1}} ∣ x_{τ_{k}} \sim N (x_{τ_{k + 1}}; x_{τ_{k}} + μ_{k} Δ τ, v_{k} Δ τ) .

x_{τ_{k + 1}} ∣ x_{τ_{k}} \sim N (x_{τ_{k + 1}}; x_{τ_{k}} + μ_{k} Δ τ, v_{k} Δ τ) .

q (x_{τ_{1} : J} ∣ x_{0}, ϕ_{X}, η) \propto k = 0 \prod D - 1 N (x_{τ_{k + 1}}; x_{τ_{k}} + μ_{k} Δ τ, v_{k} Δ τ);

q (x_{τ_{1} : J} ∣ x_{0}, ϕ_{X}, η) \propto k = 0 \prod D - 1 N (x_{τ_{k + 1}}; x_{τ_{k}} + μ_{k} Δ τ, v_{k} Δ τ);

x_{m, τ_{k + 1}} ∣ x_{m, τ_{k}}

x_{m, τ_{k + 1}} ∣ x_{m, τ_{k}}

x_{m, τ_{k + 1}} ∣ x_{m, τ_{k}}

x_{τ_{k + 1}} ∣ x_{τ_{k}}, y_{J} \sim N {x_{τ_{k + 1}}; x_{τ_{k}} + μ_{MDB} (x_{τ_{k}}, y_{J}) Δ τ, Ψ_{MDB} (x_{τ_{k}}) Δ τ}

x_{τ_{k + 1}} ∣ x_{τ_{k}}, y_{J} \sim N {x_{τ_{k + 1}}; x_{τ_{k}} + μ_{MDB} (x_{τ_{k}}, y_{J}) Δ τ, Ψ_{MDB} (x_{τ_{k}}) Δ τ}

μ_{MDB} (x_{τ_{k}}, y_{J})

μ_{MDB} (x_{τ_{k}}, y_{J})

Ψ_{MDB} (x_{τ_{k}})

X_{t}

X_{t}

d ζ_{t}

d R_{t}

X_{t}

X_{t}

d η_{t}

d R_{t}

x_{τ_{k + 1}} ∣ x_{τ_{k}}, y_{J} \sim N (x_{τ_{k + 1}}; x_{τ_{k}} + μ_{RB} (x_{τ_{k}}, y_{J}) Δ τ, Ψ_{RB} (x_{τ_{k}, y_{J}}) Δ τ),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

imkebotha/particle-mcmc-sdemem
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Particle Methods for Stochastic Differential Equation Mixed Effects Models

Imke Botha

Robert Kohn

Christopher Drovandi

Abstract

Parameter inference for stochastic differential equation mixed effects models (SDEMEMs) is a challenging problem. Analytical solutions for these models are rarely available, which means that the likelihood is also intractable. In this case, exact inference is possible using the pseudo-marginal method, where the intractable likelihood is replaced by its nonnegative unbiased estimate. A useful application of this idea is particle MCMC, which uses a particle filter estimate of the likelihood. While the exact posterior is targeted by these methods, a naive implementation for SDEMEMs can be highly inefficient. We develop three extensions to the naive approach which exploits specific aspects of SDEMEMs and other advances such as correlated pseudo-marginal methods. We compare these methods on real and simulated data from a tumour xenography study on mice.

Keywords— Bayesian inference, Hierarchical models, MCMC, Particle Gibbs, Pseudo-marginal, Random effects

1 Introduction

Stochastic differential equations (SDEs) may be defined as ordinary differential equations (ODEs) with one or more stochastic components. SDEs allow for random variations around the mean dynamics specified by the ODE. These models can be used to capture inherent randomness in the system of interest. For repeated measures data, random effects can be used to account for between-subject variability. Assuming measurement error leads to a state-space SDE mixed effects model (SDEMEM).

SDEMEMs are emerging as a useful class of models for biomedical and pharmacokinetic/pharmacodynamic data (Donnet et al.,, 2010; Donnet and Samson, 2013a, ; Leander et al.,, 2015). They have also been applied in psychology (Oravecz et al.,, 2011) and spatio-temporal modelling (Duan et al.,, 2009). Statistical inference for these models is generally difficult however. In most cases, the SDE does not have an explicit or analytical solution (transition density), making the likelihood intractable. Including random effects adds further complexity.

Parameter inference for SDEMEMs has largely focussed on maximum likelihood estimation; e.g. Picchini et al., (2010), Picchini and Ditlevsen, (2011), Delattre et al., (2013) and Donnet and Samson, 2013a ; Donnet and Samson, 2013b . There are few Bayesian inference methods; Donnet et al., (2010) propose a Gibbs sampler coupled with an Euler-Maruyama discretization of the intractable transition density. This approach targets an approximation to the posterior, whose error can be controlled for some models. Whitaker et al., 2017a take a data augmentation approach based on a diffusion bridge, which allows for non-linear dynamics between observed time points. Picchini and Forman, (2019) compare results from a particle MCMC algorithm (Andrieu and Roberts,, 2009; Andrieu et al.,, 2010) and a Bayesian synthetic likelihood approach (Wood,, 2010; Price et al.,, 2018). They apply both methods to an SDE with known solution, and suggest an Euler-Maruyama approximation if the solution is unavailable.

It is unlikely however that any one approach to estimating SDEMEMs will be optimal for all applications. Performance will depend on the complexity of the underlying SDE, the number of parameters, the number of observations for each subject, as well as the complexity of the random effects. It has been our experience that methods that work well on simple examples can often fail badly on more complex ones. This motivates our article to focus on significant extensions to the pseudo-marginal approach of Picchini and Forman, (2019) for SDEMEMs. Pseudo-marginal methods can overcome some limitations of data augmentation approaches because they integrate out the latent states (Stramer and Bognar,, 2011; Gunawan et al., 2018b, ). This is especially useful when there is substantial correlation between the latent variables and the parameters of interest. Our article develops a suite of new and efficient Bayesian methods for SDEMEMs by taking advantage of advances in particle methods that can exploit specific aspects of SDEMEMs. As a by-product, we compare the performance of a collection of pseudo-marginal methods for our models of interest. The results of this comparison will be of interest to the wider computational Bayesian community. We compare these methods on a model adapted from one used by Picchini and Forman, (2019) to model the growth of tumour volumes in mice.

The rest of the paper is organised as follows. Sections 2 and 3 provide the necessary background on state space models, stochastic differential equations, particle filters and particle MCMC methods. Section 4 proposes three potential particle methods for SDEMEMs. Sections 5-7 compares these methods with the approach of Picchini and Forman, (2019) on simulated and real data from a tumor xenography study on mice. Section 8 concludes with a discussion of the results and possible future work. Code for our methods is available at https://github.com/imkebotha/particle-mcmc-sdemem.

2 Stochastic Differential Equation Mixed Effects Models

We denote random variables by capital letters and their realisations by lowercase letters; $\mathbb{N}$ is the set of positive integers. We use $\sim$ to denote both the distribution and density of a random variable, with the meaning made clear through its context.

2.1 State Space Models

State space models (SSMs) consist of two processes: a Markov process $\{X_{t}\}_{t\geq 0}\subset\mathcal{X}^{\mathbb{N}}$ , where $X_{t}$ is usually only partially observed and is often viewed as a latent process, and an observed process $\{Y_{t}\}_{t\geq 0}\subset\mathcal{Y}^{\mathbb{N}}$ . The $\mathcal{X}$ and $\mathcal{Y}$ spaces are usually subsets of Euclidean space $\mathbb{R}$ and are often $\mathbb{R}$ itself. To obtain a SSM, we assume that $\{(x_{t},y_{t});t\geq 0\}$ is Markov with model parameters $\bm{\theta}$ , so that

[TABLE]

We simplify further and assume that

[TABLE]

where $g(y_{t}\mid x_{t},\bm{\theta})$ is the observation density and $f(x_{t}\mid x_{t-1},\bm{\theta})$ the transition density. Since $\bm{\theta}$ is unknown, it is assigned a prior $\pi(\bm{\theta})$ . The unnormalized posterior density of the latent states and model parameters is

[TABLE]

where

[TABLE]

To obtain parameter inference for $\bm{\theta}$ , we need to consider the marginal posterior,

[TABLE]

with likelihood

[TABLE]

especially if high correlation exists between $\bm{\theta}$ and $\bm{x}_{0:T-1}$ . However, this integral is usually intractable. For some models, inference may also be complicated by an intractable transition density, e.g. the SDEs in Section 2.2. While approximate methods can be used in this case, exact inference is still feasible if it is possible to simulate from the transition density (see Section 3.2).

2.2 Stochastic Differential Equation Mixed Effects Models

It is possible to construct a stochastic differential equation (SDE) from an ordinary differential equation by adding noise or replacing one of the terms in the model by a stochastic process. For simplicity, we decribe a one-dimensional SDE, but it is straightforward to extend the methods introduced in Section 4 to higher dimensions. Given an Itô process $\{X_{t}\}_{t\geq 0}$ (Øksendal,, 2013), the general form for a 1-dimensional continuous SDE is

[TABLE]

where $\mu(\cdot)$ is the drift, $\sqrt{v}(\cdot)$ is the diffusion, $\bm{\phi_{X}}$ are the fixed model parameters for the SDE and $\{B_{t}\}_{t\geq 0}$ is a standard Brownian motion process. This model can be extended by allowing some of the parameters to vary between the $m=1,\ldots,M$ individuals. In this case we have $\{X_{m,t}\}_{t\geq 0}$ for $m=1,\ldots,M$ instead of $X_{t}$ . In this more general setting, let $\bm{\phi_{X}}$ be the vector of fixed common parameters of the SDE, and $\bm{\eta}_{m}$ the vector of subject specific parameters (random effects), where $\bm{\eta}_{m}\sim P(\bm{\phi_{\eta}})$ . Then, the SDEMEM is given by,

[TABLE]

The solution to (3) gives the transition density of the states. If an analytical solution for the transition density is unavailable, numerical methods can be used; some of these are discussed in Section 2.3.

Equation (3) leads a state-space model as defined in Section 2.1. Let $y_{m,t}\in\{Y_{m,t}\}$ denote a noisy observation for individual $m,m=1,\ldots,M$ at time $\xi_{m,t},t=0,\ldots,T_{m}-1$ , where $T_{m}$ is the number of observations for individual $m$ . To simplify notation, we assume that observations are taken at the same time points for all individuals, i.e. $\xi_{t},t=0,\ldots,T_{m}-1$ , but this restriction is not required for our methods. We assume that the observations are given by

[TABLE]

Let $\bm{\theta}=(\sigma,\bm{\phi_{X}},\bm{\phi_{\eta}})$ be the vector of all unkown parameters in the model, $\bm{y}_{m}=\bm{y}_{m,0:T_{m}-1}$ and $\bm{x}_{m}=\bm{x}_{m,0:T_{m}-1}$ . The posterior of $\bm{\theta},\bm{\eta}_{1:M}$ can be expressed as

[TABLE]

We will use the following running example throughout the paper to illustrate some of the concepts and methods.

Example: SDEMEM with constant drift and diffusion. Consider the SDEMEM

[TABLE]

with random effects $\eta_{m}=\log(\beta_{m})$ , unknown static model parameters $\bm{\phi_{X}}=\{\gamma,x_{0}\}$ and random effect hyperparameters $\bm{\phi_{\eta}}=\{\mu_{\beta},\sigma_{\beta}\}$ . The exact transition density of this model can be obtained by solving (5),

[TABLE]

If a Gaussian observation density is assumed, the full model is given by

[TABLE]

where $\bm{\theta}=\{\sigma,\gamma,x_{0},\mu_{\beta},\sigma_{\beta}\}$ .

2.3 SDE Simulation

Consider the SDEMEM for a single individual

[TABLE]

If the SDE cannot be solved analytically, then it is necessary to use approximate methods. This section describes two common approaches for approximate simulation of SDEs.

2.3.1 Euler-Maruyama

The Euler-Maruyama discretization (EMD) is the simplest method for simulating approximately from an SDE. Given a process $\{X_{t}\}_{t\geq 0}$ , the time interval $[0,J]$ is split into $D$ subintervals,

[TABLE]

Assuming that the drift and diffusion coefficients are locally constant,

[TABLE]

the EMD simulates over each subinterval as follows

[TABLE]

Since $\Delta B_{\tau_{k}}\sim\mathcal{N}(\Delta B_{\tau_{k}};0,\Delta\tau)$ by definition, the path is simulated through a recursive application of

[TABLE]

Thus, the joint density of this approximation is

[TABLE]

we note that for an SDE with constant drift and diffusion, the EMD gives the exact solution.

Example: SDEMEM with constant drift and diffusion. For the SDEMEM in Equation (5), $\mu_{k}=\beta_{m}$ and $v_{k}=\gamma^{2}$ . The EMD of this model is

[TABLE]

which in this case is the exact transition density.

2.3.2 Diffusion Bridges

Simulating from the (approximate) transition density may not perform well in pseudo-marginal methods if any particular observations are highly informative or there is little observation noise. More efficient estimates of the likelihood of SSMs can be achieved if the proposal for $x_{t}$ can be directed towards $y_{t}$ . One option to do this is to use a diffusion bridge.

The modified diffusion bridge (MDB) of Durham and Gallant, (2002) (see also Golightly and Wilkinson,, 2008) is derived by approximating the joint distribution of $X_{\tau_{k+1}},Y_{J}\mid x_{\tau_{k}}$ using multivariate normal theory, and then conditioning on $Y_{J}=y_{J}$ . The density $X_{\tau_{k+1}},Y_{J}\mid x_{\tau_{k}}$ is obtained from the observation density (4) and the EMD of $X_{\tau_{k+1}}\mid x_{\tau_{k}}$ . See Appendix 1 of Golightly and Wilkinson, (2008) for a more detailed derivation. The MDB gives a bridge proposal of the form

[TABLE]

where $\Delta_{k}=J-\tau_{k}$ ,

[TABLE]

Whitaker et al., 2017b notes that the modified diffusion bridge can perform poorly when the drift coefficient is not approximately constant. To overcome this problem, they propose to partition the SDE into a deterministic process and a residual stochastic process, such that the latter has constant drift. Rewriting the model in terms of these processes gives

[TABLE]

The idea is to choose $\zeta_{t}$ and $f(\cdot)$ such that the drift of (7) is approximately constant. The simplest solution (Whitaker et al., 2017b, ) is to set $\zeta_{t}=\eta_{t}$ and $f(\cdot)=\mu(\cdot)$ as

[TABLE]

noting that $Y_{J}-\eta_{J}=R_{J}+\epsilon_{J}$ . The residual bridge is obtained by constructing the MDB on the residual process $\{R_{t}\}$ . This gives a bridge proposal of the form

[TABLE]

where

[TABLE]

3 Particle MCMC

3.1 Particle Filters

Exact state estimation of SSMs using the Kalman filter is only possible when they are Gaussian or conditionally Gaussian. In the case of non-linear, non-Gaussian SSMs, a particle filter can be used for simulation consistent estimation (Gordon et al.,, 1993; Carpenter et al.,, 1999; Doucet et al.,, 2000; Del Moral et al.,, 2006; Doucet and Johansen,, 2009).

Particle filters are used to traverse through a sequence of intermediary distributions towards some target distribution. We describe the generic particle filter of Doucet and Johansen, (2009) (see Algorithm 1), with filtering distribution of the form

[TABLE]

A combination of move, reweight and resample steps are used to transition through this sequence. The move step generates values for $x_{t}$ from some proposal distribution $q(x_{t}\mid y_{t},x_{t-1},\bm{\theta})$ . Once moved, the $N$ particles are re-weighted according to,

[TABLE]

Particles are then resampled with probability $\bm{W}_{t}^{1:N}$ for the next iteration. This is done to avoid particle impoverishment, where most of the weight is given to few particles. There are several resampling methods that can be used, including multinomial, stratified (Kitagawa,, 1996), and more recently, Srinivasan (Gerber et al.,, 2019).

An attractive feature of the particle filter is that an unbiased estimate of the likelihood may be obtained from the unnormalized weights,

[TABLE]

The bootstrap particle filter (BPF) of Gordon et al., (1993) is a special case of the PF with $q(x_{t}\mid y_{t},x_{t-1},\bm{\theta})=f(x_{t}\mid x_{t-1},\bm{\theta})$ . The calculation of the weights then simplifies to $w_{t}^{n}=W_{t-1}^{n}g(y_{t}\mid x_{t},\bm{\theta})$ .

3.2 Pseudo-Marginal MCMC

The pseudo-marginal approach of Andrieu and Roberts, (2009) allows for exact inference for models with intractable likelihoods. In this approach, the intractable likelihood $P(\bm{y}_{1:T}\mid\bm{\theta})$ is replaced with a nonnegative unbiased estimate of the form $\widehat{P}(\bm{y}_{1:T}\mid\bm{\theta})$ , which we write as $P(\bm{y}_{1:T}\mid\bm{\theta},\bm{u})$ , where $\bm{u}\sim P(\bm{u})$ are the auxiliary variables used to estimate the likelihood. Since this estimate is unbiased, $\mathbb{E}_{P(\bm{u})}(P(\bm{y}_{1:T}\mid\bm{\theta},\bm{u}))=P(\bm{y}_{1:T}\mid\bm{\theta})$ . Pseudo-marginal MCMC can therefore be defined as standard MCMC on an augmented space, i.e. the space of $\bm{\theta}$ augmented with $\bm{u}$ . The chain targets $P(\bm{\theta},\bm{u}\mid\bm{y}_{1:T})$ which has the posterior $P(\bm{\theta}\mid\bm{y}_{1:T})$ as marginal distribution, as

[TABLE]

The next sections describes the particle marginal Metropolis-Hastings (PMMH) and particle Gibbs (PG) algorithms proposed by Andrieu et al., (2010).

3.2.1 Particle Marginal Metropolis-Hastings

The PMMH method is a Metropolis-Hastings algorithm where the intractable likelihood is replaced by its unbiased estimate (see Section 4 for its use in our particle filter application). As discussed in Section 3.2, the resulting chain targets the joint density $P(\bm{\theta},\bm{u}\mid\bm{y}_{1:T})$ , where $\bm{u}$ is the vector of random numbers used in the particle filter. Unbiasedness implies the posterior of interest $P(\bm{\theta}\mid\bm{y}_{1:T})$ is obtained through marginalisation.

A drawback of the PMMH algorithm is that it can be difficult to find good proposals. Another drawback is the chain’s tendency to get stuck whenever the likelihood is greatly overestimated for a particular value of $\bm{\theta}$ , i.e. if $\widehat{P}(\bm{y}_{1:T}\mid\bm{\theta})$ is greatly overestimated, then the acceptance probability for $\bm{\theta}^{*}$ will be very small unless $\widehat{P}(\bm{y}_{1:T}\mid\bm{\theta^{*}})$ is also overestimated. This can be mitigated by decreasing the variance of the log of the ratio of the likelihood estimates

[TABLE]

A common strategy to do this is to increase the number of particles used in the particle filter. Sherlock et al., (2015), Pitt et al., (2012) and Doucet et al., (2015) showed that optimal performance (for random walk proposals) is gained when $N$ is chosen such that the standard deviation of the estimated log-likelihood is between $1$ and $2$ . An alternative approach is the correlated pseudo-marginal (CPM) method of Deligiannidis et al., (2018) (see also Dahlin et al., (2015)). Tran et al., (2016) introduced a variation of the CPM method called the block pseudo-marginal (BPM) approach. The BPM has a natural application to SDEMEMs, which is discussed further in Section 3.2.2.

3.2.2 Correlated Pseudo-Marginal

Recall that the chain targets the density $p(\bm{\theta},\bm{u}|\bm{y}_{1:T})$ . At iterations $i$ and $i+1$ , the estimates returned are proportional to $\widehat{P}(\bm{\theta}^{(i)},\bm{u}^{(i)}\mid\bm{y}_{1:T})$ and $\widehat{P}(\bm{\theta}^{(i+1)},\bm{u}^{(i+1)}\mid\bm{y}_{1:T})$ . Deligiannidis et al., (2018) show that the mixing of the chain can be improved by correlating $\widehat{P}(\bm{\theta}^{(i)},\bm{u}^{(i)}\mid\bm{y}_{1:T})$ and $\widehat{P}(\bm{\theta}^{(i+1)},\bm{u}^{(i+1)}\mid\bm{y}_{1:T})$ . This helps to vastly reduce the variance of (9), without having to reduce the variance of the individual log-likelihood estimates.

The CPM approach correlates these estimates by making $\bm{u}^{(i)}$ and $\bm{u}^{(i+1)}$ highly correlated. Assuming the random numbers are normally distributed, Deligiannidis et al., (2018) use the Crank-Nicolson (CN) proposal to induce the correlation

[TABLE]

If the particle filter depends on non-normal random numbers, transformations to normality are applied.

In BPM, correlation is induced by updating $\bm{u}$ in blocks (Tran et al.,, 2016). In this approach, the vector of random numbers $\bm{u}$ is divided into $B$ blocks, and a single block is updated at each iteration while the remaining $B-1$ are held constant. The resulting correlation between the logs of the likelihood estimates is approximately $1-1/B$ and is induced much more directly than CPM. No assumption about the form or distribution of $\bm{u}$ is required. This approach has a natural application to SDEMEMs, as the blocks can be defined using the subjects, i.e. each block contains all random numbers needed to estimate the likelihood for one or more subjects.

Relative to standard PMMH, both CPM and BPM are able to tolerate significantly more variance in the log-likelihood estimates, such that less particles are needed for the chain to mix well. The increase in computational efficiency gained from this typically outweighs the overhead associated with storing the vector of random numbers $\bm{u}$ .

The number of particles $N$ needed for CPM and BPM can be tuned using the log-likelihood ratio (9) (Deligiannidis et al.,, 2018). To minimize the distance between successive log-likelihood estimates, the number of particles $N$ may be chosen such that the variance of (9) is around 1.

3.2.3 Conditional Particle Filter

The particle Gibbs (PG) algorithm of Andrieu et al., (2010), requires a variation of the generic PF (Section 3.1) called the conditional particle filter (CPF). The CPF differs from the generic PF by holding a single path $\bm{x}_{1:T}^{k}$ invariant throughout the iterations. See Algorithm 3 for more details.

Once a weighted sample is obtained, a new invariant path may be drawn using the backwards sampling method of Whiteley, (2010) and Lindsten and Schön, (2012); see Algorithm 4.

3.2.4 Particle Gibbs

In PMMH, the particle filter returns an estimate of the likelihood (2). In particle Gibbs, the latent states are updated using a conditional particle filter, i.e. $\bm{x}_{1:T}$ is approximately sampled from $p(\bm{x^{*}}_{1:T}\mid\bm{y}_{1:T},\bm{x}_{1:T},\bm{\theta})$ (see Algorithms 3 and 5). The parameters $\bm{\theta}$ may be updated using Gibbs sampling if the full conditional posterior is available, or a Metropolis-Hastings step if it is not.

Since a new path $\bm{x}_{1:T}$ is simulated at each iteration, PG does not suffer from the same mixing problem as PMMH. As such, it is significantly less sensitive to the number of particles used. PG also has the advantage that more efficient updating schemes for $\bm{\theta}$ can be used, such as MALA or HMC. While this method has a number of advantages over PMMH, it is not as generally applicable as a closed form transition density is required to update $\bm{\theta}$ .

4 Methods

We are interested in parameter inference for the state-space SDEMEM described in Section 2.2. For a single individual $m$ , with observations taken at $\xi_{t},t=0,\ldots,T_{m}-1$ and level of discretisation $D$ , the sequence of distributions (8) traversed by the particle filter (see Section 3.1) is

[TABLE]

This particle filter returns an estimate of $P(\bm{y}_{m}\mid\bm{\eta}_{m},\sigma,\bm{\phi_{X}})$ . The estimated likehood for all the data $\bm{y}_{1:M}$ is given by

[TABLE]

4.1 Individual-Augmentation Pseudo-Marginal

The first method is Individual-Augmentation Pseudo-Marginal (IAPM), named for the additional auxiliary variables required to estimate the likelihood for each individual. Here, we use the likelihood estimate,

[TABLE]

with importance distribution $g(\bm{\eta}_{m}\mid\bm{\theta})$ within a PMMH algorithm (Algorithm 2). See Algorithms 6 and 7 for more details.

The variability of $\widehat{P}(\bm{y}_{m}\mid\bm{\theta})$ for a given $g(\bm{\eta}_{m}\mid\bm{\theta})$ is controlled by the number of particles $N$ , as well as the number of random effects draws $L$ . The choice of importance distribution $g(\cdot\mid\bm{\theta})$ has an important impact on both of these quantities. A naive choice is $g(\bm{\eta}_{m}\mid\bm{\theta})=P(\bm{\eta}_{m}\mid\bm{\theta})$ . While this simplifies the likelihood calculation, it can be very inefficient if $\widehat{P}(\bm{\eta}_{m}\mid\bm{y}_{m},\bm{\theta})$ and $P(\bm{\eta}_{m}\mid\bm{\theta})$ are not similar. We propose instead to use a Laplace approximation of a distribution over $\bm{\eta}_{m}$ that is proportional to

[TABLE]

where $\bm{{\widehat{x}}}_{m}$ is an approximation of $\bm{x}_{m}$ . We present two choices for $\bm{\widehat{x}}_{m}$ . The first uses the solution of the ODE given by the drift of the SDEMEM (3),

[TABLE]

The second approximates $\bm{x}_{m}$ with the mean of the modified diffusion bridge (see Section 2.3.2), with $\Delta_{k}=\Delta_{t}=\tau_{t+1}-\tau_{t}$ , such that

[TABLE]

We refer to these importance distributions as Laplace-ODE and Laplace-MDB respectively.

As a variance reduction technique, randomised quasi-Monte Carlo (RQMC) can be used to draw $\bm{\eta}_{m}^{(l)}$ (step 2 of Algorithm 7). See L’Ecuyer, (2016) for an overview of RQMC.

A correlated version of IAPM (cIAPM) is possible using block pseudo-marginal as described in Section 3.2.2. Here, the vector of random numbers is given by $\bm{u}=(\bm{u_{\mathrm{RE}}},\bm{u_{\mathrm{PF}}})$ , where $\bm{u_{\mathrm{RE}}}$ and $\bm{u_{\mathrm{PF}}}$ are the random numbers used to draw the random effects and those used in the particle filter respectively. At each iteration of the chain, new random numbers for individual $m,1\leq m\leq M$ are proposed, while the rest are held constant. This induces a correlation of approximately $1-\frac{1}{M}$ between successive log-likelihood estimates (Tran et al.,, 2016). Since BPM makes no assumptions about the distribution of $\bm{u}$ , RQMC is straightforward to use within cIAPM.

Example: SDEMEM with constant drift and diffusion. For the SDEMEM in Equation (5), the IAPM approximation of $P(\bm{y}_{m}\mid\bm{\theta})$ with importance distribution $g(\bm{\eta}_{m}\mid\bm{\theta})$ is given by

[TABLE]

where $\widehat{P}(\bm{y}_{m}\mid\bm{\beta}_{m}^{(l)},\sigma,\gamma)$ is the PF estimate of $P(\bm{y}_{m}\mid\bm{\beta}_{m}^{(l)},\sigma,\gamma)$ .

4.2 Component-Wise Pseudo-Marginal

This section defines a component-wise pseudo-marginal (CWPM) method, where the random effects $\bm{{\eta}}_{1:M}$ are updated along with $\bm{\theta}$ . This leads naturally to the following parameter blocks $\bm{\eta}_{1:M},\{\sigma,\bm{\phi_{X}}\}$ and $\bm{\phi_{\eta}}$ . If we denote $\bm{\theta_{X}}=\{\sigma,\bm{\phi_{X}}\}$ , then the joint posterior is of the form

[TABLE]

and the full conditional posteriors for each of the parameter blocks are

[TABLE]

A particle filter estimate of $P(\bm{y}_{1:M}\mid\bm{{\eta}}_{1:M},\bm{\theta_{X}})$ is used when updating $\bm{\eta}_{1:M}$ and $\bm{\theta_{X}}$ (Algorithm 1). The parameter $\bm{\phi_{\eta}}$ can be sampled directly since (10) is tractable. This method is generally faster than IAPM as the particle filter is called $2\times M$ times per MCMC iteration (with the above configuration), instead of $L\times M$ times as in IAPM. If there is a high correlation between $\bm{{\eta}}_{1:M}$ and $\bm{\theta}$ , however, the CWPM chain may mix poorly.

A correlated version of CWPM (cCWPM) may be implemented using BPM. Again, only the random numbers for a single individual are updated at each iteration while the rest are held constant.

Example: SDEMEM with constant drift and diffusion. For the SDEMEM in Equation (5), the parameters are updated in the following blocks, $\bm{\eta}_{m}=\{\beta_{m}\}$ , $\bm{\theta_{X}}=\{\sigma,\gamma,x_{0}\}$ and $\bm{\phi_{\eta}}=\{\mu_{\beta},\sigma_{\beta}\}$ .

4.3 Mixed Particle Method

Our final method is a variation of the PMMH $+$ PG algorithm of Gunawan et al., 2018a . We use a combination of PMMH and PG to update the parameters $\bm{{\eta}}_{1:M},\sigma,\bm{\phi_{X}}$ and $\bm{\phi_{\eta}}$ , depending on the form of the full conditional distributions,

[TABLE]

At each iteration, the invariant path $\bm{x}_{1:M}$ is updated using a conditional particle filter (Algorithm 3). Where the density $P(\bm{y}_{1:M}\mid\bm{{\eta}}_{1:M},\sigma,\bm{\phi_{X}})$ is required, i.e. (11) and (12), a particle filter estimate is used (PMMH step). Since the full conditionals for $\sigma$ and $\bm{\phi_{\eta}}$ are tractable, these parameters can be sampled directly. It is important that the likelihood estimate is updated once a new value of $\sigma$ is accepted. This must be done with the same $\bm{u}$ that was used to estimate the previous likelihood. As with CWPM (Section 4.2), mixing of the Markov chain can be poor if high correlation exists between $\bm{{\eta}}_{1:M}$ and $\bm{\theta}$ and/or $\bm{x}_{1:M}$ and $\sigma$ .

Similarly to IAPM and CWPM, a correlated version of MPM (cMPM) can be implemented using BPM, where $\bm{u}$ is divided into $M$ blocks based on the individuals $m=1,\ldots,M$ .

Example: SDEMEM with constant drift and diffusion. For the SDEMEM in Equation (5), the parameters are updated in the following blocks, $\bm{\eta}_{m}=\{\beta_{m}\}$ , $\bm{\phi_{X}}=\{\gamma,x_{0}\}$ , $\bm{\phi_{\eta}}=\{\mu_{\beta},\sigma_{\beta}\}$ and $\sigma$ .

4.4 Likelihood Estimation

We have introduced three particle MCMC methods for SDEMEMs: IAPM, CWPM and MPM. Each of these methods relies on a particle filter to calculate an unbiased estimate of the intractable likelihood. Tuning parameters for this calculation include the level of discretization ( $D$ ), the number of particles ( $N$ ) and, for IAPM, the number of random effects draws ( $L$ ). We use the log-likelihood ratio $R$ (9) as described in Section 3.2.2 to tune $D,N$ and $L$ . We denote the standard deviation of $|R|$ as $\sigma_{\Delta}$ and aim for $\sigma_{\Delta}\leq 1.05$ .

It is also necessary to specify a proposal function for the particle filter and an importance density for IAPM. Section 2.3 describes three different ways to simulate from an SDE: the Euler-Maruyama discretization (EMD), the modified diffusion bridge (MDB) and the residual bridge (RB). Any of these can be used to move particles within a particle filter. Section 4.1 also proposes the Laplace-ODE and Laplace-MDB importance densities for IAPM. The optimal choice of the proposal function and the importance density is problem specific and may have a large impact on the efficiency of the likelihood estimate.

5 Example

5.1 Data

We apply our methods to real data from a tumour xenography study on mice. This data was obtained from Picchini and Forman, (2019). The study had 4 treatment groups and 1 control group, and each group had 7-8 mice. Measurements were taken every Monday, Wednesday and Friday for six weeks; however the majority of the mice were euthanized before the end of the study, once their tumour volumes exceeded $1000$ cubic mm.

We focus specifically on group 5 (the control group). There are 7 mice in this group, with 2-14 observations per mouse and 34 observations in total. Only one mouse in this group survived longer than 11 days, being euthanized on day 32 of the study. Figure 1 plots this data.

5.2 Model

To fit the data, we consider an adaptation of an SDEMEM that was used by Picchini and Forman, (2019) for unperturbed growth. It is assumed that there are $m=1,\ldots,M$ subjects, with measurements taken at discrete times $\xi_{t},t=1,\ldots,T_{m}$ , where $T_{m}$ is the number of observations for subject $m$ . The model is defined as,

[TABLE]

where $V_{m,t}$ is the volume of subject $m$ at time $t$ . The random effects for this model are the parameters $\beta_{m}$ and $V_{m0}$ , which are assigned the prior distributions

[TABLE]

The observations are modelled as

[TABLE]

Since the data is observed on the log scale, the transformation $X_{m,t}=\log(V_{m,t})$ can be applied to (13) and (14) using Itô’s lemma. The full model is then given by

[TABLE]

The likelihood is intractable since model (15) does not have a closed form solution for $X_{m,t}$ . The following priors were assigned to the static parameters $\theta=(\mu_{X0},\sigma_{X0},\mu_{\beta},\sigma_{\beta},\gamma,\sigma,\rho)^{T}$

[TABLE]

where $\mathcal{HN}(\sigma)$ refers to the half-normal distribution with mean zero and scale parameter $\sigma$ .

Note that taking $\rho=1$ gives model (6). This was the original SDEMEM used by Picchini and Forman, (2019). We add the parameter $\rho$ which allows for both a more flexible variance and renders the transition density intractable. We test this model on the dataset introduced in Section 5.1. To ensure numerical stability when simulating from the SDE, we scaled the observation times by the maximum time observed. In addition to the real data, we also apply our methods to synthetic data simulated from model (15) using $\bm{\theta}=(\mu_{X0},\sigma_{X0},\mu_{\beta},\sigma_{\beta},\gamma,\sigma,\rho)^{T}=(3,1,-1,1,1,0.5,1)^{T}$ .

For the synthetic data, we assumed 1000 mice with 457 observations each - this corresponds to a measurement every hour for 19 days following the initial measurement. We used 9 subsets of this dataset with all combinations of 10, 100 and 1000 subjects and an observation every 24 hours (20 observations), 12 hours (39 observations) and 1 hour (457 observations). We refer to these datasets as sim( $M,H$ ), where $M$ is the number of subjects (10, 100, or 1000) and $H$ is the number of hours between observations (24, 12 or 1). For example, the subset of 100 subjects with an observation every 12 hours is denoted sim(100, 12), while the full dataset is denoted sim(1000, 1). When $M$ is left blank, we refer to all datasets with the specified value of $H$ and vice versa, e.g. sim(, 1) represents sim(10, 1), sim(100, 1) and sim(1000, 1). The performance of our methods on these datasets gives an indication of their scalability with respect to the density of the time series and number of subjects. Figure 2 plots this data.

6 Likelihood Estimation Results

All code was implemented in MATLAB. Vectorisation and parallelisation were applied where possible, e.g. we used vectorised code for the particle operations and parallelised over the subjects in the particle filter. For IAPM we also parallelised over the random effects draws when running the importance sampler. Our results were calculated using 8 cores. Note that parallelisation was only applied in the particle filter when the average number of observations per subject was greater than 10. We used adaptive resampling in the particle filter when estimating the likelhood. Resampling was done at every iteration in the conditional particle filter.

We first consider the efficiency of the likelihood estimation. For each of the three methods, we tested all possible combinations of proposal function and importance density (IAPM). We define the naive method or combination as the IAPM algorithm with the prior as importance density and the Euler-Maruyama approximation as the proposal function in the particle filter.

As outlined in Section 4.4, we set the tuning parameters such that $\sigma_{\Delta}\leq 1.05$ . Measurements were calculated from a minimum of $1000$ log-likelihood estimates at a fixed value of $\bm{\theta}$ and $\bm{\eta}_{1:M}$ (CWPM). For the real data, we used $\bm{\theta}=(4,1,2,1,1.6,0.05,1)$ , which was obtained from a few preliminary MCMC runs (low values of $N,L$ and $D$ were sufficient for this). For the simulated data, we used the true value $\bm{\theta}=(3,1,-1,1,1,0.5,1)$ . The random effects $\bm{\eta}_{1:M}$ were determined similarly, using preliminary runs for the real data and the true values for the synthetic data.

We define the level of discretization ( $D$ ) as the number of intermediate timepoints between each observation. We found that the results are not particularly sensitive to this value, so we fixed $D$ at 10 for all methods. Computation was stopped if the computation time for a single log-likelihood estimate exceeded 15 minutes or required more than 150gb of RAM.

In this section, we use the notation ‘importance density + proposal function’ to refer to a particular combination of the two, e.g. prior + RB. All combinations were tuned to roughly the same statistical efficiency (based on $\sigma_{\Delta}$ ), so the most efficient method was taken as the one with the lowest computation time. Further mention of statistical efficiency refers to the value of the tuning parameters $N$ and $L$ .

6.1 IAPM

To tune the IAPM method, we made the simplifying assumption that $N=L$ . Tuning was done through trial and error. Of the three methods, IAPM was the most difficult and time-consuming to tune. Assuming $N=L$ simplified the tuning process, but it is not ideal. Depending on the implementation of the code, having a larger/smaller $N$ or $L$ may have a significant impact on the computation time.

Once we started testing combinations, we found that the variance of the Laplace-ODE importance density tends to [math] for at least one of the random effects, such that the draws for that random effect were close to equal. We solved this by setting the covariance to a diagonal matrix of the prior variances scaled by $0.5$ . We denote this altered importance density as L-ODE.

Tables 1-4 summarize the log-likelihood results for all datasets. Dashed lines indicate that computation time exceeded the time limit specified in Section 6. This limit was exceeded for all prior and L-ODE combinations on the sim(1000, ) datasets. For the correlated versions of these, we found that the value of $\sigma_{\Delta}$ had a very high variance. All versions of IAPM exceeded the time limit on dataset sim(1000, 1).

For the synthetic data, the Laplace-MDB importance density outperformed the prior and L-ODE in terms of overall efficiency. Of the latter, the L-ODE showed the poorest performance. Results for the uncorrelated versions are only available for sim(10, 24) and sim(10, 12) and these were also the only datasets with L-ODE combinations that outperformed the prior. Based on these results, the ODE may not a good approximation of the underlying states. A large diffusion coefficient and/or measurement error could account for this.

The most efficient proposal function depended on the size of the dataset. In terms of statistical efficiency, the MDB and RB have nearly identical results across all datasets, and generally outperforms the EMD. The RB takes slightly longer to run than the MDB however, and both are slower than the EMD. While this had little effect on the smallest datasets, the time difference was significant on the larger ones. The EMD approximation gave the best results on the sim(, 1) datasets.

Correlating the log-likelihoods generally increased the statistical efficiency. On the larger datasets, this increase is significant, as is the corresponding reduction in computation time. Interestingly, for all sim(10, ) datasets, the uncorrelated Laplace-MDB + EMD was more statistically efficient than the correlated version. The same was also true for Laplace-MDB + MDB and RB on sim(10, 1).

For the real data, the best results were given by the Laplace-MDB in combination with the MDB or RB. A large gain in statistical efficiency was observed relative to the naive combination, i.e. prior + EMD. In the uncorrelated case, the tuning parameters reduced from $L=N=200$ to $L=N=4$ , and in the correlated from $L=N=90$ to $L=N=3$ . A 5.5-fold decrease in time was observed from the uncorrelated naive to the best method.

6.2 CWPM

For CWPM, it was only necessary to select a proposal function and find a value for $N$ . Again, this was done through experimentation. Tables 5-8 show results for all datasets. Dashed lines indicate that the memory limit specified in Section 6 was exceeded.

For the synthetic datasets, the correlated version had the best results across all proposal functions. The number of particles needed for the standard version grew quickly with the size of the dataset. Also, since the correlation induced is approximately $1-1/M$ , the correlated version showed greater improvement as the number of subjects increased.

As with IAPM, the most efficient proposal function depends on the size of the dataset. The best results are given by the MDB/RB, MDB and EMD for the sim(, 24), sim(, 12) and sim(, 1) datasets respectively. For the sim( , 1) datasets, any benefit in statistical efficiency from the bridges was outweighed by the increase in computation time.

For the real data, we found that a single particle was sufficient to obtain $\sigma_{\Delta}\leq 1.05$ when using MDB or RB.

6.3 MPM

This method uses the same log-likelihood estimate as CWPM, so no extra tuning was required. When $N>1$ , we use the same number of particles for the conditional particle filter as for the standard. When $N=1$ , as is the case for the real data (see Section 6.2), we add an extra particle to account for the invariant path.

7 MCMC Results

We used the time per log-likelihood estimate from Section 6 to determine which methods to run, i.e. $\leq$ 2 seconds for IAPM, $\leq$ 1 second for CWPM and $\leq$ 0.5 second for MPM. Each of these was run for 100,000 iterations starting at the same values of $\bm{\theta}$ that was used in Section 6. The best proposal function and importance density (for IAPM) from Section 6 was used. Where the MDB and RB proposal functions gave similar results, MDB was the preferred choice. Due to the time constraints, the naive method (uncorrelated IAPM with prior + EMD) was only run on the real and sim(10, 24) datasets. None of the methods were run on the sim(100,1) or sim(1000, 1) datasets.

We used random walk proposals for the parameters which could not be updated directly, i.e. those updated with a PMMH step. In CWPM and MPM, we used the pre-conditioned Metropolis-adjusted Langevin algorithm (MALA) to update the random effects hyperparameters $\{\mu_{X0},\sigma_{X0},\mu_{\beta},\sigma_{\beta}\}$ , and in MPM, we used a slice sampler to update $\sigma$ . For the proposals, we needed to tune the random walk covariance (also used as the MALA pre-conditioning matrix), and the stepsize for MALA. This was done through experimentation. For CWPM and MPM, we found it was easier to tune the variances for the random effects after a good covariance matrix had been found for $\bm{\theta}$ .

We compare the methods based on the multivariate effective sample size (multiESS) (Vats et al.,, 2015) of $\bm{\theta}$ and the computation time in minutes. A score for each method is calculated as the approximate rate of independent samples per minute ( $\mathrm{\frac{multiESS}{time}}$ ). Table 9 shows the score for each method. Table 10 shows the breakdown of the multiESS for each update block. Tables 11-12 (see Appendix A) show the acceptance rates (AR) for the three methods on all datasets and Figure 3 shows the marginal posteriors of $\bm{\theta}$ for all datasets. As expected, the marginal posteriors become more precise as the size of the dataset grows (via more subjects and/or more densely observed time series).

A large increase in multiESS was observed between IAPM, and CWPM and MPM on all datasets. This is partly due to the $\bm{X_{0}}$ hyperparameters. It is clear from Table 10 that the multiESS for $\{\mu_{X0},\sigma_{X0}\}$ is always larger than the multiESS for any of the other parameter blocks. Based on this, a more efficient algorithm for this particular example might be to use IAPM for $\bm{\theta}$ and $\bm{\beta}$ and CWPM for $\bm{X_{0}}$ . The other reason for the increase in multiESS is the more efficient proposals used for $\bm{\phi_{\eta}}$ and $\sigma$ (in MPM). For both the real and synthetic data, MPM gave the highest multiESS, followed by CWPM.

Across all datasets, the largest score was given by CWPM. This is due both to the higher multiESS compared to IAPM, and the relatively short computation time. In general, CWPM ran much faster than the other two methods. The exception to this was on the sim(10, 24) dataset, where IAPM had the fastest run time. We also note, that MPM sometimes took longer to run than IAPM on the smaller datasets. The reason for this is how parallelisation was applied. As noted before, parallelisation was only implemented within the particle filter if the average number of observations per subject was greater than 10, i.e. only on the sim(,12) and sim(,1) datasets. For IAPM however, the importance sampler was always parallelised. As a result, if $L$ is small enough, e.g. less than the number of available cores, then IAPM would not necessarily take longer to run than CWPM or MPM. This also depends on the number of particles needed for the latter two methods.

8 Discussion

We introduced three methods for simulation consistent parameter inference of state-space SDEMEMs and outlined some strategies for improving the efficiency of the likelihood estimate for these methods through the choice of importance density and proposal function. The efficiency of the calculation can also be increased by correlating successive log-likelihood estimates.

The recent paper by Wiqvist et al., (July 23, 2019) independently introduced a method for SDEMEMs that is very similar to our CWPM method. They propose the same update blocks for the parameters as in CWPM and give three variations of this approach, namely naive Gibbs, blocked Gibbs and a correlated PMMH method. In the first, the random numbers $u$ are updated whenever the likelihood is estimated. In blocked Gibbs, $u$ is updated with the random effects but kept fixed for the other parameter blocks. Lastly, their correlated PMMH method uses the approach of Deligiannidis et al., (2018) to correlate the likelihoods, i.e. by correlating the random numbers (see Section 3.2.2).

Our approach differs in that we use the block pseudo-marginal (BPM) method of Tran et al., (2016). In the context of mixed effects models, BPM has a number of advantages over CPM. It is simple to implement, induces correlation more directly, and makes no assumptions about the underlying distribution of the random numbers, i.e. no transformations to normality are required and it is straighforward to use with RQMC. Also, an efficient implementation only requires the random seed to be stored, which can greatly reduce the computational storage requirements. A drawback of BPM however, is that the correlation is limited by the number of subjects. If there are few subjects, then CPM may be more effective at inducing correlation. Another option might be to combine BPM with CPM, i.e. correlating the auxiliary variables in the current block, while keeping the rest fixed. The feasibility of this approach is an area of future research.

To further improve efficiency, we exploit bridge proposals in the particle filter rather than proposing directly from the (approximate) transition density as in the standard bootstrap filter used by Wiqvist et al., (2019). By including the IAPM and MPM methods, our paper provides a more comprehensive suite of particle methods for application to general state-space SDEMEMs. Wiqvist et al., (2019) allow the number of particles to vary between individuals, which is also straightforward to implement in our methods.

With IAPM, CWPM and MPM, we were able to greatly improve upon the efficiency of the naive method, particularly in computational efficiency. For the majority of the simulated datasets, the naive approach is not computationally feasible at all. The statistical efficiency of a given method depends on the correlation between the model parameters, random effects and/or latent states. These methods are flexible in the sense that they can be tailored to a specific model and used in combination, e.g. by integrating over a subset of the random effects using IAPM, but updating the rest using CWPM or MPM steps. Note that if IAPM is combined with MPM, then the invariant path from the conditional PF may be used for $\bm{{\widehat{x}}}_{m}$ in the importance sampler. For our particular example, CWPM gave the best results. In general, this method had the shortest computation time and was the easiest to tune; however as noted before, care must be taken if high correlation exists between the random effects and model parameters. The best method to use in any particular situation greatly depends on the model and data.

A significant drawback of all these methods is the amount of tuning required. For all methods (including the naive), there are at least two tuning parameters required for the likelihood estimation. We also do not have a standard way to select the importance density and proposal function, as well as guidelines to indicate whether a correlated pseudo-marginal approach should be used. The last depends on the values of $L$ and $N$ , which in turn depend on the efficiency of the method and the dimension of the data. All methods require tuning the MCMC proposal densities. In order to reduce the tuning burden, we plan to embed these methods into a sequential Monte Carlo sampler (Del Moral et al.,, 2006) in future research.

It may be possible to choose the importance density based on the proposal function, i.e. EMD + Laplace-ODE (or L-ODE) and MDB/RB + Laplace-MDB. Recall also that the Laplace-ODE approximates the underlying states using the ODE specified by the drift of the SDEMEM. The feasibility of this importance density then relies on how quickly the solution of the ODE can be computed. Exploration of the model could potentially indicate a sensible choice of proposal function and level of discretization $D$ . For our example (see Section 5), the MDB proposal function generally gives the best results compared to the EMD approximation and RB construct. There are a number of different bridge constructs that can be used however; see Whitaker et al., 2017b for an overview. The guided proposals of Schauer et al., (2013) (see also van der Meulen and Schauer, (2017)) are also an option.

Lastly, zero-variance control variates (Mira et al.,, 2013; Friel et al.,, 2016; South et al.,, 2019) may be used to further reduce the variance of any expectation estimated from the chains, e.g. the expectation of the target with respect to the auxiliary variables. Efficiency of the methods may also be increased through non-centered parameterisations of the random effects $\bm{\eta}_{1:M}$ (Papaspiliopoulos et al.,, 2007).

9 Acknowledgments

We would like to thank Umberto Picchini and the research team at the Centre for Nanomedicine and Theranostics (DTU Nanotech, Denmark) for providing the real data and Andrew Golightly for useful feedback on an earlier draft of this paper. IB was supported by an Australian Reseach Training Program Stipend and an ACEMS Top-Up Scholarship. IB would also like to thank ACEMS for funding a trip to visit RK at UNSW where some of this research took place. CD was supported by an Australian Research Council’s Discovery Early Career Researcher Award funding scheme (DE160100741). The work by RK was partially supported by an ARC Center of Excellence grant (CE140100049). We gratefully acknowledge the computational resources provided by QUT’s High Performance Computing and Research Support Group (HPC).

Appendix A - Acceptance Rates

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Andrieu et al., (2010) Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 72(3):269–342.
2Andrieu and Roberts, (2009) Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics , 37(2):697–725.
3Carpenter et al., (1999) Carpenter, J., Clifford, P., and Fearnhead, P. (1999). Improved particle filter for nonlinear problems. IEE Proceedings - Radar, Sonar and Navigation , 146(1):2–7.
4Dahlin et al., (2015) Dahlin, J., Lindsten, F., Kronander, J., and Schön, T. B. (2015). Accelerating pseudo-marginal Metropolis-Hastings by correlating auxiliary variables. ar Xiv preprint ar Xiv:1511.05483 .
5Del Moral et al., (2006) Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(3):411–436.
6Delattre et al., (2013) Delattre, M., Genon-Catalot, V., and Samson, A. (2013). Maximum likelihood estimation for stochastic differential equations with random effects. Scandinavian Journal of Statistics , 40(2):322–343.
7Deligiannidis et al., (2018) Deligiannidis, G., Doucet, A., and Pitt, M. K. (2018). The correlated pseudomarginal method. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 80(5):839–870.
8Donnet et al., (2010) Donnet, S., Foulley, J.-L., and Samson, A. (2010). Bayesian analysis of growth curves using mixed models defined by stochastic differential equations. Biometrics , 66(3):733–741.