Inference for Differential Equation Models using Relaxation via   Dynamical Systems

Kyoungjae Lee; Jaeyong Lee; Sarat C. Dass

arXiv:1705.04436·stat.ME·May 15, 2017·Comput. Stat. Data Anal.

Inference for Differential Equation Models using Relaxation via Dynamical Systems

Kyoungjae Lee, Jaeyong Lee, Sarat C. Dass

PDF

TL;DR

This paper introduces a fast Bayesian inference framework for ODE-based models by relaxing the ODE system with numerical methods like Runge-Kutta and Gaussian noise, enabling efficient parameter estimation.

Contribution

It proposes a novel, computationally efficient Bayesian approach for parameter inference in ODE models using numerical relaxation and provides theoretical convergence guarantees.

Findings

01

Method is at least 14 times faster than existing approaches

02

Theoretical convergence of the posterior is established

03

Explicit relations between numerical method parameters and convergence rate

Abstract

Statistical regression models whose mean functions are represented by ordinary differential equations (ODEs) can be used to describe phenomenons dynamical in nature, which are abundant in areas such as biology, climatology and genetics. The estimation of parameters of ODE based models is essential for understanding its dynamics, but the lack of an analytical solution of the ODE makes the parameter estimation challenging. The aim of this paper is to propose a general and fast framework of statistical inference for ODE based models by relaxation of the underlying ODE system. Relaxation is achieved by a properly chosen numerical procedure, such as the Runge-Kutta, and by introducing additive Gaussian noises with small variances. Consequently, filtering methods can be applied to obtain the posterior distribution of the parameters in the Bayesian framework. The main advantage of the proposed…

Figures8

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: The table of mean of the absolute biases, standard deviations and root mean squared errors (rmse) for θ ^ ^ 𝜃 \hat{\theta} in the FitzHugh-Nagumo model. The results for the relaxed DEM with ELW filter (RDEM), parameter cascading (PC)method, Laplace approximated procedure (LAP) and delayed rejection adaptive Metropolis (DRAM) algorithm are shown.

		RDEM	PC	LAP	DRAM
Absolute bias	$θ_{1}$	0.051	0.024	0.024	0.024
	$θ_{2}$	0.135	0.106	0.099	0.100
	$θ_{3}$	0.108	0.039	0.044	0.047
Standard deviation	$θ_{1}$	0.063	0.027	0.027	0.028
	$θ_{2}$	0.130	0.123	0.117	0.119
	$θ_{3}$	0.194	0.060	0.056	0.059
rmse	$θ_{1}$	0.084	0.038	0.038	0.040
	$θ_{2}$	0.198	0.171	0.161	0.164
	$θ_{3}$	0.233	0.076	0.075	0.079

Table 2. Table 2: Posterior summary statistics for the parameter of the Lotka-Volterra equation for the lynx-hare data with m = 2 𝑚 2 m=2 and u 2 = 10 superscript 𝑢 2 10 u^{2}=10 .

	Mean	Median	90% credible interval
$θ_{1}$	0.526	0.525	(0.491, 0.562)
$θ_{2}$	0.026	0.026	(0.024, 0.027)
$θ_{3}$	0.986	0.985	(0.906, 1.067)
$θ_{4}$	0.028	0.028	(0.026, 0.030)
$σ^{2}$	4.087	3.818	(2.018, 7.065)

Equations124

\overset{x}{˙} (t) = f (x, u, t; θ),

\overset{x}{˙} (t) = f (x, u, t; θ),

y_{i} = x (t_{i}) + ϵ_{i}, i = 1, \dots, n,

y_{i} = x (t_{i}) + ϵ_{i}, i = 1, \dots, n,

y_{i} \overset{x}{˙} (t) = x_{i} + ϵ_{i}, i = 1, \dots, n, = f (x, u, t; θ)

y_{i} \overset{x}{˙} (t) = x_{i} + ϵ_{i}, i = 1, \dots, n, = f (x, u, t; θ)

x_{i + 1}

x_{i + 1}

k_{i 1}

k_{i 1}

k_{i 2}

k_{i 3}

k_{i 4}

y_{i} x_{i + 1} = x_{i} + ϵ_{i}, i = 1, \dots, n, = g (x_{i}, t_{i}; θ), i = 0, \dots, n - 1.

y_{i} x_{i + 1} = x_{i} + ϵ_{i}, i = 1, \dots, n, = g (x_{i}, t_{i}; θ), i = 0, \dots, n - 1.

y_{i} \tilde{x}_{i + 1} = \tilde{x}_{i} + ϵ_{i}, i = 1, \dots, n, = g (\tilde{x}_{i}, t_{i}; θ) + η_{i}, i = 0, \dots, n - 1

y_{i} \tilde{x}_{i + 1} = \tilde{x}_{i} + ϵ_{i}, i = 1, \dots, n, = g (\tilde{x}_{i}, t_{i}; θ) + η_{i}, i = 0, \dots, n - 1

x_{0} ∣ λ λ \sim N_{p} (μ_{x_{0}}, c λ^{- 1} I_{p}) \mbox an d \sim Gamma (a_{λ}, b_{λ}),

x_{0} ∣ λ λ \sim N_{p} (μ_{x_{0}}, c λ^{- 1} I_{p}) \mbox an d \sim Gamma (a_{λ}, b_{λ}),

y_{i}

y_{i}

x_{i}

θ_{i}

π (γ ∣ y_{1 : i}, x_{0 : i}, θ) = G amma (a_{λ} + \frac{( i + 1 ) p}{2}, b_{λ} + \frac{1}{2} (\frac{∥ x _{0} - μ _{x_{0}} ∥ ^{2}}{c} + k = 1 \sum i ∥ y_{k} - x_{k} ∥^{2}))

π (γ ∣ y_{1 : i}, x_{0 : i}, θ) = G amma (a_{λ} + \frac{( i + 1 ) p}{2}, b_{λ} + \frac{1}{2} (\frac{∥ x _{0} - μ _{x_{0}} ∥ ^{2}}{c} + k = 1 \sum i ∥ y_{k} - x_{k} ∥^{2}))

p (x_{i} ∣ x_{i - 1}, y_{i}, θ_{i}, γ) = N (\frac{y _{i} / σ ^{2} + g ( x _{i - 1} , t _{i} , θ _{i} ) / u ^{2}}{1/ σ ^{2} + 1/ u ^{2}}, \frac{1}{1/ σ ^{2} + 1/ u ^{2}} I_{p}),

p (x_{i} ∣ x_{i - 1}, y_{i}, θ_{i}, γ) = N (\frac{y _{i} / σ ^{2} + g ( x _{i - 1} , t _{i} , θ _{i} ) / u ^{2}}{1/ σ ^{2} + 1/ u ^{2}}, \frac{1}{1/ σ ^{2} + 1/ u ^{2}} I_{p}),

p (y_{i} ∣ x_{i - 1}, θ_{i}, γ) = N (g (x_{i - 1}, t_{i}, θ_{i}), (σ^{2} + u^{2}) I_{p}) .

p (y_{i} ∣ x_{i - 1}, θ_{i}, γ) = N (g (x_{i - 1}, t_{i}, θ_{i}), (σ^{2} + u^{2}) I_{p}) .

y_{i + 1}

y_{i + 1}

x_{i + 1}

θ_{i + 1}

[y_{i + 1}, x_{i + 1}, θ_{i + 1} ∣ x_{i}, θ_{i}, y_{1 : i}, γ] = [y_{i + 1} ∣ x_{i + 1}, γ] \cdot [x_{i + 1} ∣ x_{i}, θ_{i + 1}] \cdot [θ_{i + 1} ∣ θ_{i}, y_{1 : i}]

[y_{i + 1}, x_{i + 1}, θ_{i + 1} ∣ x_{i}, θ_{i}, y_{1 : i}, γ] = [y_{i + 1} ∣ x_{i + 1}, γ] \cdot [x_{i + 1} ∣ x_{i}, θ_{i + 1}] \cdot [θ_{i + 1} ∣ θ_{i}, y_{1 : i}]

[y_{i + 1}, x_{i + 1}, θ_{i + 1} ∣ x_{i}, θ_{i}, y_{1 : i}, γ] = [x_{i + 1} ∣ x_{i}, θ_{i + 1}, y_{i + 1}, γ] \cdot [y_{i + 1} ∣ x_{i}, θ_{i + 1}, γ] \cdot [θ_{i + 1} ∣ θ_{i}, y_{1 : i}] .

[y_{i + 1}, x_{i + 1}, θ_{i + 1} ∣ x_{i}, θ_{i}, y_{1 : i}, γ] = [x_{i + 1} ∣ x_{i}, θ_{i + 1}, y_{i + 1}, γ] \cdot [y_{i + 1} ∣ x_{i}, θ_{i + 1}, γ] \cdot [θ_{i + 1} ∣ θ_{i}, y_{1 : i}] .

q (x_{0}, θ, γ) \equiv π (x_{0}) \times π (θ, γ ∣ y_{n}) .

q (x_{0}, θ, γ) \equiv π (x_{0}) \times π (θ, γ ∣ y_{n}) .

π (x_{0}, θ, λ ∣ y_{n}, u^{2}) = \frac{\int L ( Λ ) π ( d x _{1} , \dots , d x _{n} ∣ x _{0} , θ , u ^{2} ) π ( x _{0} , θ , λ )}{\int\int L ( Λ ) π ( d x _{1} , \dots , d x _{n} ∣ x _{0} , θ , u ^{2} ) π ( d x _{0} , d θ , d λ )}

π (x_{0}, θ, λ ∣ y_{n}, u^{2}) = \frac{\int L ( Λ ) π ( d x _{1} , \dots , d x _{n} ∣ x _{0} , θ , u ^{2} ) π ( x _{0} , θ , λ )}{\int\int L ( Λ ) π ( d x _{1} , \dots , d x _{n} ∣ x _{0} , θ , u ^{2} ) π ( d x _{0} , d θ , d λ )}

π (x_{0}, θ, λ ∣ y_{n}) = \frac{L ^{*} ( x _{0} , θ , λ ) π ( x _{0} , θ , λ )}{\int L ^{*} ( x _{0} , θ , λ ) π ( d x _{0} , d θ , d λ )}

π (x_{0}, θ, λ ∣ y_{n}) = \frac{L ^{*} ( x _{0} , θ , λ ) π ( x _{0} , θ , λ )}{\int L ^{*} ( x _{0} , θ , λ ) π ( d x _{0} , d θ , d λ )}

L (Λ)

L (Λ)

L^{*} (x_{0}, θ, λ)

π (x_{0}, θ, λ ∣ y_{n}, u^{2}) \to π (x_{0}, θ, λ ∣ y_{n})

π (x_{0}, θ, λ ∣ y_{n}, u^{2}) \to π (x_{0}, θ, λ ∣ y_{n})

∥ f (x, t; θ) - f (x^{'}, t; θ) ∥ < K ∥ x - x^{'} ∥

∥ f (x, t; θ) - f (x^{'}, t; θ) ∥ < K ∥ x - x^{'} ∥

π_{m} (x_{0}, θ, λ ∣ y_{n}) \to π_{true} (x_{0}, θ, λ ∣ y_{n})

π_{m} (x_{0}, θ, λ ∣ y_{n}) \to π_{true} (x_{0}, θ, λ ∣ y_{n})

π_{m} (x_{0}, θ, λ ∣ y_{n}) = π (x_{0}, θ, λ ∣ y_{n}) \times (1 + O (n^{- R}))

π_{m} (x_{0}, θ, λ ∣ y_{n}) = π (x_{0}, θ, λ ∣ y_{n}) \times (1 + O (n^{- R}))

\overset{x}{˙} (t)

\overset{x}{˙} (t)

x (t) = θ_{2} - (θ_{2} - x_{0}) e^{θ_{1} t}

x (t) = θ_{2} - (θ_{2} - x_{0}) e^{θ_{1} t}

x_{0} ∣ λ

x_{0} ∣ λ

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Inference for Differential Equation Models using Relaxation via Dynamical Systems

Kyoungjae Lee

Department of Applied and Computational Mathematics and Statistics, The University of Notre Dame

Jaeyong Lee

Department of Statistics, Seoul National University

Sarat Dass

Department of Fundamental and Applied Sciences, Universiti Teknologi PETRONAS

Abstract

Statistical regression models whose mean functions are represented by ordinary differential equations (ODEs) can be used to describe phenomenons dynamical in nature, which are abundant in areas such as biology, climatology and genetics. The estimation of parameters of ODE based models is essential for understanding its dynamics, but the lack of an analytical solution of the ODE makes the parameter estimation challenging. The aim of this paper is to propose a general and fast framework of statistical inference for ODE based models by relaxation of the underlying ODE system. Relaxation is achieved by a properly chosen numerical procedure, such as the Runge-Kutta, and by introducing additive Gaussian noises with small variances. Consequently, filtering methods can be applied to obtain the posterior distribution of the parameters in the Bayesian framework. The main advantage of the proposed method is computation speed. In a simulation study, the proposed method was at least 14 times faster than the other methods. Theoretical results which guarantee the convergence of the posterior of the approximated dynamical system to the posterior of true model are presented. Explicit expressions are given that relate the order and the mesh size of the Runge-Kutta procedure to the rate of convergence of the approximated posterior as a function of sample size.

Key words: Ordinary differential equation, Dynamic model, Runge-Kutta Method, Extended Liu and West filter

1 Introduction

Many dynamical phenomenons in the real world can be represented mathematically by ordinary differential equations (ODEs). Common examples include Newton’s law of cooling, Lotka-Volterra equations for predator-prey populations (Alligood et al., 1997) and Lorenz equation for atmospheric convection (Lorenz, 1963). There are many other popular examples describing physical, chemical and biological phenomenons using ODEs. Although observing the data sets from an ODE systems is common, estimating the parameters of ODE models (ODEMs) can be challenging because of lack of an analytical solution to ODE. Here, we give a brief review of previous works on the ODEMs.

There are several frequentist methods in the literature for parameter estimation of ODEMs. Bard (1974) used numerical integration to approximate the solution of ODEs and minimized the objective function based on a gradient method. Varah (1982) suggested a two step estimation method using the cubic spline approximation. The two steps consist of estimation of the regression function and estimation of the parameters in the ODEM. Ramsay and Silverman (2005) modified the first step of Varah by adding the roughness penalty function which measures the difference between the ODE and the mean function. The parameter cascading method was proposed by Ramsay et al. (2007). They grouped the parameters into the regression coefficients, structural parameters, and regularization parameters. The parameters in each group are estimated in turn in a cascading fashion.

Bayesian inference of ODEMs is more challenging because naive application of Markov Chain Monte Carlo (MCMC) methods would require calculation of the numerical solution of ODE whenever parameters are sampled from the proposal distribution. Gelman et al. (1996) and Huang et al. (2006) proposed a Bayesian computation method for parameter inference of pharmacokinetic models and the longitudinal HIV dynamic system, respectively. Campbell (2007) combined the parallel tempering (Geyer, 1991) and collocation method (Ramsay et al., 2007) to get over the rough surface of the posterior, but this slows down the speed of computations significantly. Arnold et al. (2013) used particle filter framework for the inference of ODEMs with linear multistep methods for the numerical integration. Dass et al. (2017) suggested a Bayesian inference with Laplace approximation for a fast computation when the dimension of $\theta$ is moderate.

In this paper, we propose a Bayesian inference method for the ODEMs using a relaxation technique via dynamical systems and associated dynamic models. Relaxation is achieved by a properly chosen numerical procedure, such as the Runge-Kutta, and by introducing additive Gaussian noise variables with variance tending to zero. The variance of the additive noise variables works as a measure of fidelity to the original ODEM and by letting it tend to zero, we recover the original model. The relaxation introduces inefficiency of the inference, but we gain the speed of the computation in return.

For a fast computation, a filtering method is applied for inferring posterior distributions of parameters in a Bayesian framework. The relaxation technique provides a dynamical system and model to which a fast inference tool based on sequential Monte Carlo can be applied to. With these sequential methods, we do not need to calculate the whole path of the numerical solution for each realization of the new parameter. It reduces the computation time significantly compared to other standard Bayesian procedures and enables us to deal with the ODEM in reasonable computing time. In subsection 5.2, to emphasize its fast computation the proposed method is compared with the other methods: the parameter cascading, the delayed rejection adaptive Metropolis algorithm and the Bayesian inference with the Laplace approximation. In the simulation study, the proposed method is from 14 times to 78 times faster than other methods.

We also derive convergence results for the approximated posteriors under suitable regularity conditions. We present a guideline for the choice of the model parameters which give a reasonable relative error rate, and provide its theoretical basis. Theoretical results which guarantee the convergence of the posterior of the approximated dynamical system to the posterior of true model are presented. Explicit expressions are given that relate the order and the mesh size of the Runge-Kutta procedure and guarantee the rate of convergence of the approximated posterior to the true posterior.

The rest of the paper is organized as follows. In section 2, we describe a differential equation model and its corresponding relaxed dynamic model counterpart as well as prior choices. The method of posterior inference is described in section 3. Some theoretical support for the proposed method are given in section 4. In section 5, we give two simulated data examples to demonstrate the speed and performance of the proposed method. A real data set, the Lynx-Hare data set, is analyzed in section 6. The discussion is given in section 7. The proofs of theorems are given in the appendix.

2 Ordinary Differential Equation Models and Nonlinear Dynamic Models

2.1 Ordinary Differential Equation Models (ODEMs)

The ODEM is the regression model with regression function $x(t)$ described by an ODE. The regression function $x(t)$ is the solution of the differential equation

[TABLE]

where $f$ is a $p$ -dimensional smooth function, $u(t)$ is a deterministic input function, $\theta\in\Theta\subset\mathbb{R}^{q}$ is the unknown parameter, and $\dot{x}(t)$ denotes the first derivative of $x(t)$ with respect to time $t$ . Since the input function $u(t)$ does not affect the general ideas of inference in this paper, it is not considered subsequently. The data are observed at $n$ points in the time interval $t\in[0,T]\subset\mathbb{R}$ , given by $0\leq t_{1},t_{2},\ldots,t_{n}\leq T$ . Thus,

[TABLE]

where $y_{i}$ is a $p$ -dimensional observation vector at time $t_{i}$ , the error $\epsilon_{i}$ is drawn independently from the multivariate normal distribution $N_{p}(0,\sigma^{2}I_{p})$ with unknown $\sigma^{2}>0$ , and $x(t_{i})\equiv x_{i}$ is the underlying regression function measured at time $t_{i}$ .

The regression model is given by

[TABLE]

where $x_{i}=x(t_{i})$ . The covariate $x_{i}$ is determined by the initial value of $x$ , $x_{0}=x(0)$ , and the parameter $\theta$ . In the rest of the paper, we call the model (2) as the regression model or the true model.

In most cases, ODE (1) does not have a closed form solution, so there is a need to approximate $x(t)$ numerically. We will use the Runge-Kutta method which is a standard numerical method for ODE. While there are many types of Runge-Kutta methods, we will only consider the 4th order method in this paper. However, our proposed method can be extended to the other approximation methods for ODE as well as other Runge-Kutta methods with different orders easily. Letting $h_{i+1}=t_{i+1}-t_{i}$ , the form of 4th order Runge-Kutta approximation for (2) is as follows:

[TABLE]

where

[TABLE]

In the above equation, all $x_{i}$ ’s indicate the approximated values. For more details, see Spijker (1996).

With this approximation, we have the following model

[TABLE]

In the remainder of this paper, we call the model (4) as a differential equation model (DEM). Sometimes to obtain better approximation of $x_{i+1}$ , we divide the interval $[t_{i-1},t_{i}]$ into $m$ small subintervals and apply the Runge-Kutta method for the subintervals. In this case, we will call the corresponding ODE model the $m$ step ODE model and $m$ the step size.

2.2 Nonlinear Dynamic Models

In practice, estimating the parameter from DEM can pose a significant computational challenge if the ODE does not have an analytical solution. Dass et al. (2017) marginalized out $x_{0}$ using Laplace approximation and conducted grid sampling to get posterior samples of $\theta$ . Their method is fast and accurate when the dimension of $\theta$ is small; however, the methodology suffers from heavy computations when the dimension of $\theta$ is large. The computation time increases exponentially as the dimension of $\theta$ increases due to the grid sampling. The griddy Gibbs sampler can be used on $\theta$ , but practical problems such as dependencies and slow convergence may arise.

In this paper, in order to make posterior inference on $\theta$ , we adopt a nonlinear dynamic model relaxation of the DEM in (4) given in terms of the model below with unknown initial condition $x_{0}$ :

[TABLE]

where $\epsilon_{i}\overset{iid}{\sim}N(0,\sigma^{2}I_{p})$ and $\eta_{i}\overset{iid}{\sim}N(0,u^{2}I_{p})$ with $\sigma,u>0$ . The error term $\eta_{i}$ reflects the fact that the approximation $g(x_{i},t_{i};\theta)$ of $x_{i+1}$ is made with uncertainty. In the remainder of the paper, we call model (5) as the approximate dynamic model obtained as a relaxation of the DEM in (4) via the relaxation parameter $u$ . The quantities $\tilde{x}_{i}$ in (5) are not the same as $x_{i}$ given in (4) since the former are quantities that are observed with error whereas the latter are not. However, note that the two models (4) and (5) become equivalent as the relaxation parameter $u\to 0$ .

In the above model (5), there are four unknown quantities, namely, $x_{0},\theta,\lambda=1/\sigma^{2}$ and $u$ . The Bayesian approach proceeds by considering priors for these quantities. We do not consider a prior for the relaxation parameter $u$ since it is artificially introduced to control the quality of the approximation. We fix $u$ to be a small positive quantity in the subsequent numerical computations. The priors on $x_{0}$ and $\lambda$ are taken as

[TABLE]

where $c>0$ and $\text{Gamma}(a,b)$ represents the Gamma distribution with mean $a/b$ and variance $a/b^{2}$ . The prior for $\theta$ , $\pi(\theta)$ , is taken independently of the rest of the unknown quantities above.

2.3 Sequential Monte Carlo

Sequential Monte Carlo (SMC) is a simulation-based method for estimating the states and the parameters of the nonlinear dynamic model. The basic idea of SMC is using the importance samples to approximate posterior at each state and updating the samples sequentially through a proper kernel. There exists an extensive literature on SMC which includes sequential importance sampling (Handschin and Mayne, 1969), bootstrap filter (Gordon et al., 1993), auxiliary particle filter (Pitt and Shephard, 1999), Rao-Blackwellised particle filter (Doucet et al., 2000), sequential Monte Carlo sampler (Del Moral et al., 2006), Liu and West filter (Liu and West, 2001), particle learning (Carvalho et al., 2010), multilevel sequential Monte Carlo sampler (Beskos et al., 2016), to name just a few. For an extensive review of SMC, see Doucet et al. (2001), Kantas et al. (2009), Lopes and Tsay (2011) or Särkkä (2013).

The SMC has advantages over other alternative posterior computation methods such as Kalman filter, extended Kalman filter and Markov chain Monte Carlo (MCMC). The Kalman filter and the extended Kalman filter are applicable to the linear dynamic model, while the SMC can be applied to the nonlinear dynamic model as well. The SMC has advantages over MCMC. First, SMC methods are much faster than MCMC methods. Whenever the new parameter is propagated in each stage of SMC, we only calculate the next step of the numerical solution. Fast computation is the biggest advantage of our method. Second, they are able to be implemented in an on-line learning scenario. When a new data point is observed, SMC just need to update one step of the algorithm, while MCMC must implement the whole algorithm again to get the new posterior samples. Due to these advantages, we choose SMC for the posterior computation of the nonlinear dynamic model, which approximates the ODE model.

3 Posterior Computations for the Approximate Dynamic Model via Sequential Monte Carlo

To obtain inference for $\theta$ based on the approximated dynamic model of (5), we will use the extended Liu and West (ELW) filter to estimate parameters and states (Rios and Lopes, 2013). We call the proposed method of computation relaxed DEM with ELW filter (RDEM-ELW) or simply RDEM. The ELW filter uses the idea of auxiliary particle filter to sample the states, and it divides the parameters into two sets, $\theta$ and $\gamma$ , representing parameters with and without sufficient statistic, respectively. The parameters denoted by $\theta$ (i.e., without the sufficient statistic) is the same set of parameters denoted by $\theta$ in (5). For the $\theta$ -set, the ELW filter introduces artificial random errors onto the static parameter $\theta$ , thus converting and combining it with the other evolving parameters which are the states $x_{i}$ (see Liu and West, 2001). Furthermore, in the ELW filter, the marginal posterior of $\theta$ at each time point is approximated by a finite mixture of normal distributions. The mean and variance of the evolution distribution are determined so that the mixture of normals does not increase the posterior variance. For the posterior update of the $\gamma$ -set of parameters, the idea of Storvik (2002) and Fearnhead (2002) is used. For the idea of ELW to be successfully applied, the posterior of $\gamma$ , $p(\gamma\mid y_{1:i},x_{0:i},\theta),\,\,i=1,\ldots,n$ , needs to be tractable, that is from which samples can be drawn directly. In particular, we assume $p(\gamma\mid y_{1:i},x_{0:i},\theta)$ depends on a sufficient statistic $s_{i}=s_{i}(y_{1:i},x_{0:i},\theta)$ .

Incorporating the evolution of $\theta$ into (5) according to the ELW methodology creates a further relaxation of the former model. The ELW model for the approximate dynamical model in (5) is given by

[TABLE]

for $i=1,2,\cdots,n$ with $\theta_{0}\sim\pi_{\theta}$ and $x_{0}$ distributed according to its prior specification in (6). In (8), $g$ is as defined in (3), and $u$ is a small fixed positive real number representing the relaxation parameter. In (9), $\bar{\theta}_{i-1}$ represents the posterior mean of $\theta$ given $y_{1:i-1}$ at time $i-1$ , $a=(1-\tilde{h}^{2})^{1/2}$ where $\tilde{h}^{2}=1-((3\delta-1)/(2\delta))^{2}$ , $\delta$ is a discounting factor usually taken to be a high value such as $0.95$ or $0.99$ , and $V_{i}$ is the covariance matrix corresponding to the evolution equation of $\theta_{i}$ . Equation (9) is the further relaxation and evolution model for $\theta$ prescribed by the ELW methodology (see Liu and West, 2001). The selection of the parameters $a$ and $\tilde{h}$ guarantees that the posterior variance of $\theta_{i}$ remains stable (i.e., does not increase) with the progression of the time index $i$ .

Several posterior distributions will be needed for the subsequent discussion and we derive their forms here. Consider $\gamma=\lambda=\sigma^{-2}$ , the inverse of the variance of observation error. ELW methodology requires the distribution $p(\gamma\,|\,y_{1:i},x_{0:i},\theta)$ be tractable and easily sampled from. In our case, the posterior distribution for $\gamma$ , conditional on observations $y_{1:i}$ , states $x_{0:i}$ and $\theta$ , is given by

[TABLE]

which is a tractable distribution. Note also from the above equation that the distribution of $\gamma$ depends on $y_{1:i}$ and $x_{0:i}$ through the sufficient statistic $s_{i}=s_{i}(y_{1:i},x_{0:i},\theta)=(a_{\lambda}+(i+1)p/2,b_{\lambda}+(\|x_{0}-\mu_{x_{0}}\|^{2}/c+\sum_{k=1}^{i}\|y_{k}-x_{k}\|^{2})/2)$ , where $a_{\lambda},b_{\lambda},c$ and $\mu_{x_{0}}$ are all fixed and known hyperparameters (see (6)). Next, the two distributions, that is (i) the conditional distribution of $x_{i}$ given $x_{i-1}$ , $y_{i}$ , $\theta_{i}$ and $\gamma$ , and (ii) the marginal distribution of $y_{i}$ given $x_{i-1}$ , $\theta_{i}$ and $\gamma$ , can be obtained by considering the joint density of $x_{i}$ and $y_{i}$ , conditional on $x_{i-1}$ , $\theta_{i}$ and $\gamma$ , from (7) and (8). From these two equations, it follows that $(x_{i},y_{i})$ is jointly normal, and thus, the conditional density of $x_{i}$ given $y_{i}$ is

[TABLE]

whereas the marginal distribution of $y_{i}$ given $x_{i-1},\theta_{i}$ and $\gamma$ , obtained by integrating out $x_{i}$ , is given by

[TABLE]

We now give the ELW algorithm for obtaining inference for $\theta$ based on the approximate dynamic model (5) and the posteriors defined above. Let the notation $[A,B,\cdots\,|\,C,D,\cdots]$ denote the conditional density of random entities (either scalars or vectors) $A,B,\cdots$ conditional on either random or fixed constant entities $C,D,\cdots$ . The ELW model of (7)-(9) can be written based on this notation as

[TABLE]

Equation (13)-(15) gives the joint distribution of $(y_{i+1},x_{i+1},\theta_{i+1})$ conditional on the observations, states and $\theta$ -values at previous time points, that is,

[TABLE]

based on (13)-(15). The auxiliary particle filter (APF) technique rewrites this joint density as

[TABLE]

The first term on the right hand side of (16) is given by (11), thus available in closed form for sampling in our examples. The second term on the right hand side of (16) is given by (12), which is again available in closed form for evaluation in our examples. The third term in (16) is the Liu and West filter for $\theta$ given by (15), which can be easily sampled from. We give our sampling methodology to sample from the posteriors using sequential Monte Carlo. Suppose $\{x_{i}^{(j)},\,\theta_{i}^{(j)},\,\gamma_{i}^{(j)},\,s_{i}^{(j)}\}$ for $j=1,2,\cdots,N$ are $N$ samples from the posterior $\,[x_{i},\,\theta_{i},\,\gamma_{i},\,s_{i}\,|\,y_{1:i}\,]$ . The subscript $i$ on $\gamma_{i}$ does not imply any evolution equation for $\gamma$ . It just denotes the random variable $\gamma$ for marginal realizations of $\gamma$ from the posterior $[\gamma\,|\,s_{i}]$ . Similarly, $s_{i}$ denotes realizations of the sufficient statistic at time point $i$ based on its functional equation, namely, $\mathcal{S}(y_{1:i},x_{0:i},\theta_{i})$ when $x_{0:i}$ and $\theta_{i}$ are samples from the posterior $[x_{0:i},\theta_{i}\,|\,y_{1:i}]$ .

The steps of our sampling algorithm is as follows:

•

First, sample $\theta_{i+1}^{(j)}\sim[\theta_{i+1}\,|\,\theta_{i}^{(j)},\,y_{1:i}]$ according to (9) for $j=1,2,\cdots,N$ .

•

Compute weights $w_{i}^{(j)}\propto[\,y_{i+1}\,|\,x_{i}^{(j)},\,\theta_{i+1}^{(j)},\,\gamma_{i}^{(j)}\,]$ for $j=1,2,\cdots,N$ .

•

Obtain $N$ resamples $\{\,\tilde{x}_{i}^{(j)},\,\tilde{\theta}_{i+1}^{(j)},\,\tilde{\gamma}_{i}^{(j)},\,\tilde{s}_{i}^{(j)}\,\}_{j=1}^{N}$ by sampling from the collection $\{\,{x}_{i}^{(j)},\,{\theta}_{i+1}^{(j)},\,{\gamma}_{i}^{(j)},\,{s}_{i}^{(j)}\,\}_{j=1}^{N}$ according to the weights $\{\,w_{i}^{(j)}\,\}_{j=1}^{N}$ .

•

Sample $\tilde{x}_{i+1}^{(j)}\sim[\,x_{i+1}\,|\,\tilde{x}_{i}^{(j)},\,\tilde{\theta}_{i+1}^{(j)},\,y_{i+1},\,\tilde{\gamma}_{i}^{(j)}\,]$ for $j=1,2,\cdots,N$ .

•

Compute $\tilde{s}_{i+1}^{(j)}=\mathcal{S}(\tilde{s}_{i}^{(j)},\,y_{i+1},\,\tilde{x}_{i+1}^{(j)},\,\tilde{\theta}_{i+1}^{(j)})$ for $j=1,2,\cdots,N$ .

•

Sample $\tilde{\gamma}_{i+1}^{(j)}\sim[\,\gamma\,|\,\tilde{s}_{i+1}^{(j)}]$ for $j=1,2,\cdots,N$ .

Then, it follows that the $N$ samples $\{\tilde{x}_{i+1}^{(j)},\,\tilde{\theta}_{i+1}^{(j)},\,\tilde{\gamma}_{i+1}^{(j)},\,\tilde{s}_{i+1}^{(j)}\}$ for $j=1,2,\cdots,N$ are realizations from the posterior $\,[x_{i+1},\,\theta_{i+1},\,\gamma_{i+1},\,s_{i+1}\,|\,y_{1:i+1}\,]$ . As the tuning parameter $\tilde{h}\rightarrow 0$ , the posterior of $\theta$ at every time point $i$ from the approximate dynamic model becomes closer to the true posterior from the DEM.

As mentioned earlier, in the above algorithm, the subscripts $i$ on $\gamma_{i}$ and $s_{i}$ do not imply any kind of evolution over time. They just represent the update of the parameter and statistic, respectively, as new data become available. The tuning parameter $a$ determines the extent of shrinkage of the normal mixture through its mean. It also controls the smoothness through the variance term $\tilde{h}^{2}V_{i}$ . It is usually prescribed to be chosen around the value $0.95$ . The tuning parameter $a$ was fixed at $0.95$ throughout the rest of examples. This corresponds to taking $\tilde{h}^{2}=1-a^{2}=0.0975$ and $\delta=1/(3-2a)=0.909$ . For the covariance matrix $V_{i}$ , we chose $V_{i}=(N-1)^{-1}\sum_{j=1}^{N}(\theta_{i-1}^{(j)}-\bar{\theta}_{i-1})(\theta_{i-1}^{(j)}-\bar{\theta}_{i-1})^{T}$ .

The initial proposal density $q(x_{0},\theta,\gamma)$ affects the performance of the algorithm. The proposal density which is concentrated around the true parameter has a better performance than the other proposal densities even with relatively small number of particles. In practice, we suggest that one run the ELW filter with initial particles $\theta^{(j)}$ and $\gamma^{(j)}$ from $\pi(\theta,\gamma)$ and rerun with the particles $\hat{\theta}^{(j)}$ and $\hat{\gamma}^{(j)}$ from the first inference. It is equivalent to consider the proposal density

[TABLE]

We call the resulting particles the refined particles. It was used throughout the rest of examples.

4 Convergence of the Posterior

4.1 Convergence of the Posterior as the relaxiation parameter decreases

In this subsection, we show that as the relaxation parameter $u$ converges to [math], the posterior density of $(x_{0},\theta,\lambda)$ from the approximate dynamic model converges to the posterior from the DEM, i.e.

[TABLE]

converges to

[TABLE]

as $u^{2}\to 0$ , where $\Lambda=(x_{0},\ldots,x_{n},\theta,\lambda)$ ,

[TABLE]

with $g^{i}(x_{0},t_{i-1};\theta)=g(g^{i-1}(x_{0},t_{i-2};\theta),t_{i-1};\theta)$ . Note that $\pi(x_{0},\theta,\lambda|{\bf y}_{n})$ is the posterior of DEM.

Theorem 4.1

Consider model (5) and prior (6). Suppose $f(x,t;\theta)$ is continuous in $x$ . Then, the posterior density of the dynamic model (5) converges to that of the differential equation model (4), i.e.

[TABLE]

for all $x_{0},\theta,\lambda$ as $u^{2}\to 0$ .

4.2 Convergence of the Posterior as the step size increases

We have shown that the posterior of the dynamic model (5) converges to that of the differential equation model (4) as $u^{2}\to 0$ . In this subsection, we will prove that the posterior of the differential equation model converges to that of the true model.

If the step size is $m$ , each time interval $[t_{i-1},t_{i}]$ is divided into $m$ segments of length $(t_{i}-t_{i-1})/m$ , and the Runge-Kutta method is applied to each subinterval to obtain $x_{i}^{\prime}s$ . To clarify the difference, let $x^{m}$ be the approximated solution of the differential equation by the fourth-order Runge-Kutta method with $m$ segments. Similarly, let $\pi_{m}$ and $\pi_{true}$ be the posterior distributions corresponding to $x^{m}$ and the true $x$ , respectively. Note $x^{m}(t_{1})=x(t_{1})$ for all $m$ .

Theorem 4.2

Consider model (4) and prior (6). Suppose $f(x,t;\theta)$ satisfies Lipschitz condition in $x$ , i.e. there exists the constant $K>0$ such that

[TABLE]

for any $x,x^{\prime}\in\mathbb{R}^{p},t\in[T_{0},T_{1}]$ and $\theta\in\Theta$ . Then, the posterior density of the differential equation model (4) converges that of the true model, i.e.

[TABLE]

for all $x_{0},\theta,\lambda$ as $m\to\infty$ .

This result guarantees that the differential equation model works well with a reasonable segments parameter $m$ under the Lipschitz condition.

4.3 Choice of the relaxation parameter and the step size

In practice, the choice of $u^{2}$ and $m$ can affect the performance of the approximation. The approximate posterior distribution may vary by different choice of these values. Theoretically, the smaller the relaxation parameter $u^{2}$ is, the closer the approximate posterior is to the true posterior. But in practice we may need moderately large value of $u^{2}$ to get stable posterior approximation. We suggest following strategy for choosing the variance of state $u^{2}$ . Consider various $u^{2}$ values from large to small values in turn. For each $u^{2}$ value, check the stability of posteriors by running two or three ELW filters simultaneously. Here, the stability means that all posterior densities based on ELW runs are closed enough to each other. Finally, use the smallest $u^{2}$ value for the inference which gives the stable result.

For convenience, let $h\equiv t_{i+1}-t_{i}$ for all $i=1,2,\ldots,n-1$ . For the choice of $m$ , we assume $h/m=O(n^{-\alpha})$ . Theoretically, the larger value of $m$ gives more accurate inference, but it would require heavier computation. In the following theorem, we relate the step size $h/m$ to the approximation error rate of the posterior, and based on the theorem we suggest values of $m$ for computation according to the acceptable error rate. The theorem requires the following assumptions.

A1.

$\{x(t):t\in[0,T]\}$ is a compact subset of $\mathbb{R}^{p}$ ;

A2.

$\{y(t):t\in[0,T]\}$ is a bounded subset of $\mathbb{R}^{p}$ ; and

A3.

the $K$ th order derivative of $f(x,t;\theta)$ with respect to $t$ exists and is continuous in $x$ and $t$ , where $K$ is the order of the numerical method $g$ .

Theorem 4.3

Consider model (4) and prior (6). Suppose $f(x,t;\theta)$ satisfies Lipschitz condition (19) in $x$ , and suppose $A1-A3$ hold. Let $K$ be the order of the numerical method $g$ and $h/m=O(n^{-\alpha})$ . If $\alpha\geq(1+R)/K$ , the error rate of the posterior approximation is $O(n^{-R})$ for sufficiently large $n$ , i.e.,

[TABLE]

for all $x_{0},\theta,\lambda$ , then $\alpha\geq(1+R)/K$ is sufficient.

Note that the order of Runge-Kutta method is 4, and the rate of $h$ is $n^{-1}$ because we consider a bounded time interval $[0,T]\subset\mathbb{R}$ with $T<\infty$ . By the above theorem, if we want to get the error rate $O(n^{-3})$ or larger, we know that it can be achieved by $m=1$ for large $n$ . However, in practice, one should notice that the additional error from the SMC sampling may arise. In such case, we may need to use $m$ bigger than $1$ .

5 Simulated Data Examples

5.1 Newton’s law of Cooling

5.1.1 Description of model and data generation step

Newton’s law of cooling, made by English physicist Isaac Newton, is a model describing the temperature change of an object. According to the model, the temperature of an object changes proportional to the temperature difference between the object and its surroundings. This notion is given by the following ODE form

[TABLE]

where $x(t)$ is the temperature of the object at time $t$ , $\theta_{1}$ is a negative constant and $\theta_{2}$ is the temperature of the surroundings. All of the temperature are in Celcius. For more details, see Incropera (2006).

We chose this model as a testbed for our method. Since the solution of (20) is known as

[TABLE]

where $x_{0}=x(0)$ , we can calculate the true posterior directly. The data $y_{i}=y(t_{i})$ was generated with the true mean function (21) and we set the model parameters as $x_{0}=20$ , $\theta=(-0.5,80)^{T}$ , $\sigma^{2}=25$ and time points $t_{i}=ih$ for $i=1,\ldots,n$ where the sample size $n=100$ and the step size $h=0.15$ . The simulated data and the true mean function are shown in Figure 1.

The priors were set by

[TABLE]

where $\mu_{x_{0}}=y_{1},a_{\lambda}=1,b_{\lambda}=1$ and $c=1$ . The values of $y_{i}$ are in the interval $[65,90]$ after 50th observation, and the temperature of the surroundings, $\theta_{2}$ , must be the around the interval. The prior of $\theta_{2}$ is set by $Uniform(50,150)$ whose support includes $[65,90]$ . With a similar reasoning, we set $\theta_{1}\sim Uniform(-100,0)$ .

The true posterior of $\theta$ and $\lambda$ can be obtained as follows:

[TABLE]

where

[TABLE]

5.1.2 Assessment of the convergence of the posteriors

We assessed the convergence of posteriors which is described at Theorem 4.1. To show that the posterior of dynamic model converges to that of DEM, we got the simulation results for RDEM with $u^{2}=1,0.1^{1},0.1^{2}$ and $0.1^{5}$ . The DEM was treated as a dynamic model with small value of $u^{2}$ . We ran the ELW filter based on 20,000 particles and fixed the number of segments $m$ at 1. For all of the settings, the ELW filter takes less than 3 seconds for 20,000 particles. The histogram of the marginal posterior distributions are drawn at Figure 2. It seems that the posterior of dynamic model approaches that of the DEM as $u^{2}$ decreases to zero. Thus, it supports the theoretical result, Theorem 4.1.

To show that the posterior of DEM converges to that of true model, we got the simulation results for the DEM with the number of segments $m=1,2,4$ and the true model. We approximated DEM by the dynamic model with $u^{2}=0.1^{5}$ . For the true model, we used a grid sampling algorithm for the true posterior (22). For each setting, the ELW filter takes less than 3 seconds for 20,000 particles. The grid set was chosen by $[-2,0]\times[70,90]$ , and each axis was divided into 50 equal length intervals resulting 51 points. 20,000 posterior samples were drawn. The histograms of the marginal posterior distributions are drawn at Figure 3. The posterior densities of DEM are quite similar to each other, but they have the larger variation than the true posterior densities.

5.2 FitzHugh-Nagumo model

5.2.1 Description of model and data generation step

FitzHugh-Nagumo model (FitzHugh, 1961; Nagumo et al. 1962) describes the action of spike potential in the giant axon of squid neurons by an ODE with two state variables and three parameters:

[TABLE]

where $-0.8<\theta_{1},\theta_{2}<0.8$ and $0<\theta_{3}<8$ . The two state variables, $x_{1}(t)$ and $x_{2}(t)$ , are the voltage across an membrane and outward currents at time $t$ , respectively.

Using the FitzHugh-Nagumo model, we compare the proposed method with the parameter cascading method (Ramsay et al., 2007), the delayed rejection adaptive Metropolis (DRAM) algorithm (Soetaert and Petzoldt, 2010) and the Laplace approximated posterior (LAP) method (Dass et al., 2017). The data $y_{i}=y(t_{i})$ was generated from DEM (4) with the model parameters $x_{0}=(-1,1)^{T},\theta=(0.2,0.2,3)^{T}$ , $\sigma^{2}=25$ and time points $t_{i}=ih$ for $i=1,\ldots,n$ , where the sample size $n=100$ and the step size $h=0.2$ , $m=400$ . The simulated data and the true mean function are shown in Figure 4. The priors were set by

[TABLE]

where $\mu_{x_{0}}=y_{1},a_{\lambda}=1,b_{\lambda}=1$ , $c=1$ and $A=\{(\theta_{1},\theta_{2},\theta_{3}):-0.8<\theta_{1},\theta_{2}<0.8,0<\theta_{3}<8\}$ .

5.2.2 Comparison with other methods

To compare the proposed method (RDEM-ELW) with other methods, the parameter cascading (PC) method, DRAM algorithm and LAP method were applied to the same data set. We used the R packages CollocInfer and FME for the parameter cascading and DRAM, respectively.

The PC method is one of the popular frequentist methods for estimating the parameters in ODE. It uses the collocation method which represents the state vector $x(t)$ as a series of basis expansion. The penalized likelihood criterion has three components: the matrix of coefficients of basis expansions $C$ , the unknown parameter $\theta$ and the smoothing parameter $\lambda$ . PC optimizes the penalized likelihood by two steps. In the inner optimization, the criterion is optimized with respect to the coefficient $C$ while $\theta$ and $\lambda$ are fixed. After that, in the outer optimization, the penalized likelihood is optimized with respect to $\theta$ while $\lambda$ is kept fixed. The smoothing parameter $\lambda$ is chosen based on the appropriate criteria such as the numerical stability of parameter estimates or the forward prediction error (Hooker et al., 2000). For more details about PC method, see Ramsay et al. (2007). For the PC method, we used the third-order B-spline basis and $2n-1$ equally spaced knots on $[t_{0},t_{n}]$ . The smoothing parameter was set by $\lambda=10^{5}$ . The initial parameter were drawn from $N(\theta_{0},(0.01)^{2}I_{q})$ where $\theta_{0}$ is the true parameter value.

The DRAM algorithm, a variant of the standard Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970), is chosen as a benchmark in the Bayesian side. With the R package FME (Soetaert and Petzoldt, 2010), one can infer the DEM with DRAM algorithm for the parameters and numerical integration for the state variables. We applied the DRAM algorithm with the initial parameter as the maximum likelihood estimate using modFit() function and the maximal number of tries 1. The parameter covariance was updated in every 100 iteration. We got 20,000 posterior samples for the inference.

LAP method is another benchmark in the Bayesian side. It is fast when the dimension of parameter is small and empirically has comparable or better performance than PC method and DRAM algorithm (Dass et al., 2017). Since the dimension of parameter is small, the grid sampling method for $\theta$ was chosen. For each parameter $\theta_{i}$ , the grid range was chosen by $[\widehat{\theta}_{i}^{R}\pm 4\widehat{sd}(\widehat{\theta}_{i}^{R})]$ where $\widehat{\theta}_{i}^{R}$ is the parameter estimate for $\theta_{i}$ from the PC method. Each axis was divided into $31$ intervals of equal length, and the step size for numerical integration was set at $m=2$ . The priors for parameters were set as in subsection 5.2.1, and 20,000 posterior samples were obtained.

For the RDEM-ELW, the step size for numerical integration and the variance for the state were chosen by $m=2$ and $u^{2}=0.1^{5}$ , respectively. The priors for parameters were set as described in subsection 5.2.1, the number of particles was chosen by $N=20,000$ . We generated 100 simulated data set using the 4th order Runge-Kutta. The model parameters were set as described in subsection 5.2.1.

For RDEM, PC and DRAM methods, R and C/C $++$ were used for implementation. R and Fortran90 were used for LAP method. On average based on 100 simulations, it took only 3.523 seconds for estimation, while the PC method, DRAM algorithm and LAP method took 49.152, 276.700 and 215.591 seconds, respectively. The boxplot of computation times for each method is given at Figure 5. The proposed RDEM method significantly reduced the computation time. It was even faster than the frequentist method, the PC method. Thus, the RDEM method has an enormous advantage in computation speed over other methods. Table 1 represents the absolute biases, standard deviations for $\hat{\theta}$ and root mean squared errors (rmse) for $\hat{\theta}$ in the FitzHugh-Nagumo model. It seems RDEM method provides reasonable estimates in terms of bias, but larger standard deviation than others.

6 Lynx-hare data: Lotka-Volterra equation

There are large number of models to express predator-prey relationships because predation is often direct, conspicuous and easy to study. Lotka-Volterra model is one of the simplest model of predator-pray interactions. Lotka (1925) and Volterra (1926) independently developed the model of the form:

[TABLE]

where $x_{1}$ denotes the number of preys, and $x_{2}$ denotes the number of their predators. The model parameters $\theta_{1},\theta_{2},\theta_{3}$ and $\theta_{4}$ are the intrinsic rate of prey population increase, the predation rate, the predator mortality rate and the offspring rate of the predator, respectively.

Lynx-hare data is a popular data set representing the number of captured lynx and snowshoe hares in North Canada which was collected by Hudson Bay company. It contains the number of furs of lynx and hares, so it implies the actual populations of them. We obtained the annual data between 1900 and 1920 recorded in thousands from Li (2012) which is given at Figure 6. The Lotka-Volterra equation, the equation (23), is fitted to the data set and used to predict the future values of trapped lynxes and hares.

The same model and prior in subsection 5.2 were used. As we mentioned in subsection 4.3, we ran the ELW filter 10 times based on $N=500,000$ particles with $u^{2}=20,10,5,1$ and $0.1^{5}$ , in turn. In this case, $u^{2}$ values smaller than $5$ lead somewhat unstable approximation even with 3,000,000 particles. Finally, the state variance was chosen by $u^{2}=5$ based on the criterion in subsection 4.3, because it gives stable posterior densities for each ELW run. The other model parameters were chosen as the subsection 5.2. On average, it took approximately 17 seconds for each run.

The marginal posterior densities of parameters are given at Figure 7. Posterior summary statistics for the first run are represented at Table 2. Figure 8 contains the scatter plots of the observations and 90% posterior credible lines for prediction values at 10 future time points when $m=2$ and $u^{2}=5$ . The predicted values of trapped lynxes and hares follow oscillation patterns. The size of prediction interval gets wider as the prediction time gets further ahead and also the predicted value become larger.

7 Discussion

A lot of biological or physical systems are given by a set of differential equations. To understand these processes, estimation of their parameters is essential. However, especially in Bayesian literature, there is no standard framework to analyze differential equation model. In many cases, the posterior of parameter does not belong a well-known family, so grid sampling or MCMC methods are used to get posterior samples. They usually suffer from heavy computation. We propose a general framework to analyze DEM using relaxation via dynamical systems. The dynamic model enables a fast inference for DEM and provides convenient sampling methods. Among the sampling algorithms for dynamic models, we adopted the ELW filter suggested by Rios and Lopes (2013). We argue that our method can be an alternative to the existing inference methods when one needs a fast and reasonable result. This argument is supported by the example in subsection 5.2. Section 4 guarantees the convergence of the approximated posterior to the true posterior. However, the theoretical results in this paper does not consider the additional error from the SMC sampling. The proposed method may be improved if a better SMC algorithm is developed.

Appendix

The following lemma shows that each $x_{i}$ given $x_{i-1},\theta,u^{2}$ converges to $g(x_{i-1},t_{t-1};\theta)$ in probability as $u^{2}\to 0$ .

Lemma 7.1

Consider model (5). Then, for $i=1,\ldots,n$ , $x_{i}$ given $x_{i-1},\theta$ and $u^{2}$ converges to $g(x_{i-1},t_{i-1};\theta)$ in probability as $u^{2}\to 0$ .

Proof of Lemma 7.1

Note that $r^{T}x_{i}|x_{i-1},\theta,u^{2}\sim N(r^{T}g(x_{i-1},t_{i-1};\theta),u^{2}\|r\|^{2})$ for all $r\in\mathbb{R}^{p},i=1,\ldots,n$ . If we denote $\phi_{[Z]}$ as a moment generating function (mgf) of random variable $Z$ , then for any $r\in\mathbb{R}^{p}$ ,

[TABLE]

as $u^{2}\to 0$ , for $i=1,\ldots,n$ . Note that (24) is mgf of $[r^{T}g(x_{i-1},t_{i-1};\theta)|x_{i-1},\theta]$ . Since the convergence of mgf implies the convergence of distribution, it implies

[TABLE]

for any $r\in\mathbb{R}^{p}$ . Hence, by the Cramer-Wold theorem (Billingsley, 1995), it implies that $[x_{i}|x_{i-1},\theta]$ converges to $g(x_{i-1},t_{i-1};\theta)$ in distribution, as $u^{2}\to 0$ . Note that given $x_{i-1}$ and $\theta$ , $g(x_{i-1},t_{i-1};\theta)$ is a constant. Thus, by Portmanteau theorem (Dudley, 2002), it implies the convergence in probability. $\square$

With the continuity condition of $f(x,t;\theta)$ in $x$ , Lemma 7.1 can be extended to the joint convergence in probability using the mathematical induction. Lemma 7.2 describes the result.

Lemma 7.2

Consider model (5). Suppose $f(x,t;\theta)$ is continuous in $x$ . Then, $[x_{1},\ldots,x_{n}\mid x_{0},\theta,u^{2}]$ converges to $(g(x_{0},t_{0};\theta),\ldots,g^{n}(x_{0},t_{n-1};\theta))$ in probability as $u^{2}\to 0$ .

Proof of Lemma 7.2

Let $X=(x_{1},\ldots,x_{n})$ and $\bar{X}=(g(x_{0},t_{0};\theta),\ldots,g^{n}(x_{0},t_{n-1};\theta))$ where

[TABLE]

by the relation (3) where $g^{i}(x_{0},t_{i};\theta)=g(g^{i-1}(x_{0},t_{i-1};\theta),t_{i};\theta)$ is defined recursively. We want to show

[TABLE]

for given $\epsilon>0$ . It suffices to prove

[TABLE]

for given $\epsilon>0$ and $i=1,\ldots,n$ . We use the mathematical induction.

When $i=1$ , we can check

[TABLE]

by Lemma 7.1. Suppose (26) holds for $i=k$ . Note

[TABLE]

By assumption, $g(x,t|\theta)$ is continuous in $x$ . Thus, (28) converges to 0 as $u^{2}\to 0$ because (26) holds for $i=k$ . Also note that (27) is

[TABLE]

Since $P(\|x_{k+1}-g(x_{k},t_{k};\theta)\|\geq\epsilon/(2n)|x_{k},\theta,u^{2})\leq 1$ and Lemma 7.1, (27) converges to 0 as $u^{2}\to 0$ by the bounded convergence theorem. $\square$

Proof of Theorem 4.1

Note that we need to prove

[TABLE]

as $u^{2}\to 0$ where $\Lambda=(x_{1},\ldots,x_{n},\theta,\lambda)$ .

To show (29), we only need to prove

[TABLE]

as $u^{2}\to 0$ . Since $L(\Lambda)=\lambda^{{np}/{2}}\exp({-\frac{\lambda}{2}\sum_{i=1}^{n}\|y_{i}-x_{i}\|^{2}})$ , it suffices to prove

[TABLE]

By Lemma 7.2, we have

[TABLE]

as $u^{2}\to 0$ . Note that the right hand side of (31) is the expectation of $\exp({-{\lambda}/{2}\cdot\sum_{i=1}^{n}\|y_{i}-x_{i}\|^{2}})$ with respect to $[g(x_{0},t_{1};\theta),\ldots,g^{n-1}(x_{0},t_{n-1};\theta)|x_{0},\theta]$ . Also note that $\exp({-{\lambda}/{2}\cdot\sum_{i=1}^{n}\|y_{i}-x_{i}\|^{2}})$ is bounded by 1 and is continuous in $x_{1},\ldots,x_{n}$ . Thus, the Portmanteau theorem implies (29).

Since we have proved (29), it suffices for (30) to show that $\int L(\Lambda)\pi(dx_{2},\ldots,dx_{n}|x_{0},\theta,u^{2})$ is dominated by an integrable random variable. It is easy to check because

[TABLE]

and $(\lambda)^{{np}/{2}}$ is integrable with respect to $\pi(x_{0},\theta,\lambda)$ . The dominated convergence theorem gives the desired result. $\blacksquare$

Proof of Theorem 4.2

Denote the likelihood of approximated $x$ with the number of segments $m$ as $L_{m}(x_{0},\theta,\lambda)$ , and let $L_{\text{true}}(x_{0},\theta,\lambda)$ be the likelihood of true $x$ . We should prove that

[TABLE]

converges to

[TABLE]

for any $x_{0},\theta$ and $\lambda$ . It is well known that if $f(x,t;\theta)$ satisfies Lipschitz condition in $x$ , then Runge-Kutta method converges to the true solution, i.e.

[TABLE]

See Cartwright and Piro (1992) for the proof. The convergence (32) implies that $L_{m}(x_{0},\theta,\lambda)$ converges to $L_{\text{true}}(x_{0},\theta,\lambda)$ for all $x_{0},\theta$ and $\lambda$ because an exponential function is continuous. It implies the convergence of numerator part.

For the denominator part, recall that

[TABLE]

and $(\lambda)^{{np}/{2}}$ is integrable with respect to $\pi(x_{0},\theta,\lambda)$ . Again, the dominated convergence theorem gives the desired result. $\blacksquare$

Proof of Theorem 4.3

At first, we want to show that under $A1-A3$ , $|ng_{n}(x_{0})-ng_{n}^{m}(x_{0})|=O(n(h/m)^{K})$ for sufficiently large $n$ . Since we assume the Lipschitz continuity of $f$ , the ODE has a unique solution with initial condition $x(t_{1})=x_{0}$ . Assumptions A1 and A3 implies

[TABLE]

for some constants $B>0$ . The local errors of the $K$ th order numerical method are given by

[TABLE]

for some $B^{\prime}>0$ , which depends only on $\sup_{t}\|d^{K}f(x,t;\theta)/(dt^{K})\|\leq B$ (Palais and Palais, 2009). Thus, the local errors are uniformly bounded. It implies the global errors uniformly bounded by

[TABLE]

for some constant $C>0$ . Thus,

[TABLE]

where $\sup_{t\in[T_{0},T_{1}]}\|y(t)\|<C_{y}<\infty$ , $\sup_{t\in[T_{0},T_{1}]}\|x(t)\|<C_{x}<\infty$ for sufficiently large $n$ .

By the above inequality, for fixed $x_{0}\in\mathbb{R}^{p},\lambda>0$ ,

[TABLE]

because $e^{x}=1+O(x)$ for sufficiently small $x$ . It implies

[TABLE]

for sufficiently large $n$ . If $\alpha>(1+R)/K$ , then we have $n(h/m)^{K}\leq n^{-R}$ . $\blacksquare$

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K.T. Alligood, T.D. Sauer, and J.A. Yorke. Chaos: An Introduction to Dynamical Systems . Chaos: An Introduction to Dynamical Systems. Springer, 1997.
2[2] Andrea Arnold, Daniela Calvetti, and Erkki Somersalo. Linear multistep methods, particle filtering and sequential monte carlo. Inverse Problems , 29(8):085007, 2013.
3[3] Yonathan Bard. Nonlinear parameter estimation . Academic Press [A subsidiary of Harcourt Brace Jovanovich, Publishers], New York-London, 1974.
4[4] Alexandros Beskos, Ajay Jasra, Kody Law, Raul Tempone, and Yan Zhou. Multilevel sequential monte carlo samplers. Stochastic Processes and their Applications , 2016.
5[5] P. Billingsley. Probability and Measure . Wiley Series in Probability and Statistics. Wiley, 1995.
6[6] D.A. Campbell. Bayesian Collocation Tempering and Generalized Profiling for Estimation of Parameters from Differential Equation Models . Canadian theses. Mc Gill University (Canada), 2007.
7[7] Julyan H. E. Cartwright and Oreste Piro. The dynamics of runge-kutta methods. Int. J. Bifurcation and Chaos , 2:427–49, 1992.
8[8] Carlos M. Carvalho, Michael Johannes, Hedibert F. Lopes, and Nicholas Polson. Particle learning and smoothing. Statistical Science , pages 88–106, 2010.