Analysis of a nonlinear importance sampling scheme for Bayesian   parameter estimation in state-space models

Joaquin Miguez; Ines P. Mari\~no; Manuel A. Vazquez

arXiv:1702.03146·stat.CO·February 13, 2017·Signal Process.

Analysis of a nonlinear importance sampling scheme for Bayesian parameter estimation in state-space models

Joaquin Miguez, Ines P. Mari\~no, Manuel A. Vazquez

PDF

Open Access

TL;DR

This paper provides a rigorous convergence analysis of a nonlinear importance sampling scheme for Bayesian parameter estimation in state-space models, demonstrating optimal convergence rates even with approximate importance weights.

Contribution

It offers the first theoretical proof of convergence for the nonlinear population Monte Carlo method, including the optimal rate and the property of exact approximation.

Findings

01

Convergence of the NPMC scheme is almost sure with rate M^{-1/2}.

02

The scheme achieves optimal Monte Carlo convergence despite constant mean error in importance weights.

03

Simulation confirms theoretical convergence properties in a target tracking model.

Abstract

The Bayesian estimation of the unknown parameters of state-space (dynamical) systems has received considerable attention over the past decade, with a handful of powerful algorithms being introduced. In this paper we tackle the theoretical analysis of the recently proposed {\it nonlinear} population Monte Carlo (NPMC). This is an iterative importance sampling scheme whose key features, compared to conventional importance samplers, are (i) the approximate computation of the importance weights (IWs) assigned to the Monte Carlo samples and (ii) the nonlinear transformation of these IWs in order to prevent the degeneracy problem that flaws the performance of conventional importance samplers. The contribution of the present paper is a rigorous proof of convergence of the nonlinear IS (NIS) scheme as the number of Monte Carlo samples, $M$ , increases. Our analysis reveals that the NIS…

Equations116

π_{n, θ} (A) = \int_{A} π_{n, θ} (d x)

π_{n, θ} (A) = \int_{A} π_{n, θ} (d x)

ξ_{n, θ} (A) = \int_{A} ξ_{n} (d x)

ξ_{n, θ} (A) = \int_{A} ξ_{n} (d x)

π_{0, θ}^{N} (d x_{0}) = \frac{1}{N} i = 1 \sum N δ_{x_{0}^{i}} (d x_{0}),

π_{0, θ}^{N} (d x_{0}) = \frac{1}{N} i = 1 \sum N δ_{x_{0}^{i}} (d x_{0}),

u_{n}^{i} = \frac{u ~ _{n}^{i}}{\sum _{j = 1}^{N} u ~ _{n}^{j}}, i = 1, ..., N .

u_{n}^{i} = \frac{u ~ _{n}^{i}}{\sum _{j = 1}^{N} u ~ _{n}^{j}}, i = 1, ..., N .

(f, π) = \int f (x) π (d x)

(f, π) = \int f (x) π (d x)

N \to \infty lim (f, π_{n, θ}^{N}) = (f, π_{n, θ})

N \to \infty lim (f, π_{n, θ}^{N}) = (f, π_{n, θ})

∣∣ (f, π_{n, θ}^{N}) - (f, π_{n, θ}) ∣ ∣_{p} \leq \frac{C _{n} ∣∣ f ∣ ∣ _{\infty}}{N}

∣∣ (f, π_{n, θ}^{N}) - (f, π_{n, θ}) ∣ ∣_{p} \leq \frac{C _{n} ∣∣ f ∣ ∣ _{\infty}}{N}

(f, π_{n, θ}^{N}) = \int f (x_{n}) π_{n, θ}^{N} (d x_{n}) = \frac{1}{N} i = 1 \sum N f (x_{n}^{i}) .

(f, π_{n, θ}^{N}) = \int f (x_{n}) π_{n, θ}^{N} (d x_{n}) = \frac{1}{N} i = 1 \sum N f (x_{n}^{i}) .

ξ_{n, θ}^{N} (d x_{n}) = \frac{1}{N} i = 1 \sum N δ_{\tilde{x}_{n}^{i}} (d x_{n}) .

ξ_{n, θ}^{N} (d x_{n}) = \frac{1}{N} i = 1 \sum N δ_{\tilde{x}_{n}^{i}} (d x_{n}) .

ℓ (y ∣ θ) = k = 1 \prod n (l_{k, θ} (y_{k} ∣ \cdot), ξ_{k, θ}),

ℓ (y ∣ θ) = k = 1 \prod n (l_{k, θ} (y_{k} ∣ \cdot), ξ_{k, θ}),

(l_{k, θ} (y_{k} ∣ \cdot), ξ_{k, θ}) = \int_{X} l_{k, θ} (y_{k} ∣ x_{k}) ξ_{k, θ} (d x_{k}) .

(l_{k, θ} (y_{k} ∣ \cdot), ξ_{k, θ}) = \int_{X} l_{k, θ} (y_{k} ∣ x_{k}) ξ_{k, θ} (d x_{k}) .

ℓ^{N} (y ∣ θ) = k = 1 \prod n (l_{k, θ} (y_{k} ∣ \cdot), ξ_{k, θ}^{N})

ℓ^{N} (y ∣ θ) = k = 1 \prod n (l_{k, θ} (y_{k} ∣ \cdot), ξ_{k, θ}^{N})

p (θ ∣ y) \propto ℓ (y ∣ θ) p_{0} (θ)

p (θ ∣ y) \propto ℓ (y ∣ θ) p_{0} (θ)

μ_{k} = i = 1 \sum M w_{k - 1}^{i} θ_{k - 1}^{i} \mbox an d Σ_{k} = i = 1 \sum M w_{k - 1}^{i} (θ_{k - 1}^{i} - μ_{k}) (θ_{k - 1}^{i} - μ_{k})^{⊤} .

μ_{k} = i = 1 \sum M w_{k - 1}^{i} θ_{k - 1}^{i} \mbox an d Σ_{k} = i = 1 \sum M w_{k - 1}^{i} (θ_{k - 1}^{i} - μ_{k}) (θ_{k - 1}^{i} - μ_{k})^{⊤} .

\hat{\sf w}_{k}^{j}={\mathcal{T}}_{M}\left(j,\{\tilde{\sf w}_{k}^{l}\}_{l=1}^{M}\right)=\left\{\begin{array}[]{ll}\tilde{\sf w}_{k}^{i_{M_{c}}},&\mbox{if $\tilde{\sf w}_{k}^{j}\geq\tilde{\sf w}_{k}^{i_{M_{c}}}$},\\ \tilde{\sf w}_{k}^{j},&\mbox{if $\tilde{\sf w}_{k}^{j}<\tilde{\sf w}_{k}^{i_{M_{c}}}$},\\ \end{array}\right..

\hat{\sf w}_{k}^{j}={\mathcal{T}}_{M}\left(j,\{\tilde{\sf w}_{k}^{l}\}_{l=1}^{M}\right)=\left\{\begin{array}[]{ll}\tilde{\sf w}_{k}^{i_{M_{c}}},&\mbox{if $\tilde{\sf w}_{k}^{j}\geq\tilde{\sf w}_{k}^{i_{M_{c}}}$},\\ \tilde{\sf w}_{k}^{j},&\mbox{if $\tilde{\sf w}_{k}^{j}<\tilde{\sf w}_{k}^{i_{M_{c}}}$},\\ \end{array}\right..

\hat{θ}_{*} = \int_{S} θ μ_{y} (d θ),

\hat{θ}_{*} = \int_{S} θ μ_{y} (d θ),

\mbox M S E (\hat{θ}) = \int_{S} (θ - \hat{θ})^{2} μ_{y} (d θ) .

\mbox M S E (\hat{θ}) = \int_{S} (θ - \hat{θ})^{2} μ_{y} (d θ) .

μ_{y, k}^{M} (d θ) = i = 1 \sum M w_{k}^{i} δ_{θ_{k}^{i}} (d θ),

μ_{y, k}^{M} (d θ) = i = 1 \sum M w_{k}^{i} δ_{θ_{k}^{i}} (d θ),

\mbox M S E (\hat{θ}_{k}^{M}) = i = 1 \sum M w_{k}^{i} ∥ θ_{k}^{i} - \hat{θ}_{k}^{M} ∥^{2} .

\mbox M S E (\hat{θ}_{k}^{M}) = i = 1 \sum M w_{k}^{i} ∥ θ_{k}^{i} - \hat{θ}_{k}^{M} ∥^{2} .

ℓ (θ) ≜ ℓ (y ∣ θ) \mbox an d ℓ^{N} (θ) ≜ ℓ^{N} (y ∣ θ) .

ℓ (θ) ≜ ℓ (y ∣ θ) \mbox an d ℓ^{N} (θ) ≜ ℓ^{N} (y ∣ θ) .

\tilde{w}^{i} = g^{N} (θ^{i}) ≜ \frac{ℓ ^{N} ( θ ^{i} ) p _{0} ( θ ^{i} )}{q ( θ ^{i} )},

\tilde{w}^{i} = g^{N} (θ^{i}) ≜ \frac{ℓ ^{N} ( θ ^{i} ) p _{0} ( θ ^{i} )}{q ( θ ^{i} )},

\hat{w}^{i} = [T^{M} \circ g^{N}] (θ^{i}),

\hat{w}^{i} = [T^{M} \circ g^{N}] (θ^{i}),

∥ l ∥_{\infty} = n \geq 1, x_{n} \in X, θ \in S sup l_{n, θ} (y_{n} ∣ x_{n}) < \infty.

∥ l ∥_{\infty} = n \geq 1, x_{n} \in X, θ \in S sup l_{n, θ} (y_{n} ∣ x_{n}) < \infty.

\frac{p _{0}}{q}_{\infty} = θ \in S sup \frac{p _{0} ( θ )}{q ( θ )} < \infty.

\frac{p _{0}}{q}_{\infty} = θ \in S sup \frac{p _{0} ( θ )}{q ( θ )} < \infty.

max {ℓ (θ), ℓ^{N} (θ)} \leq ∥ l ∥_{\infty}^{R} < \infty \mbox an d E [ℓ^{N} (θ)] = ℓ (θ)

max {ℓ (θ), ℓ^{N} (θ)} \leq ∥ l ∥_{\infty}^{R} < \infty \mbox an d E [ℓ^{N} (θ)] = ℓ (θ)

(f, μ) ≜ \int_{S} f (θ) μ (d θ),

(f, μ) ≜ \int_{S} f (θ) μ (d θ),

(f, μ) \approx (f, μ^{M}) = i = 1 \sum M f (θ^{i}) w^{i},

(f, μ) \approx (f, μ^{M}) = i = 1 \sum M f (θ^{i}) w^{i},

∣ (f, μ^{M}) - (f, μ) ∣ \leq \frac{V _{f, ϵ}}{M ^{\frac{1}{2} - ϵ}} .

∣ (f, μ^{M}) - (f, μ) ∣ \leq \frac{V _{f, ϵ}}{M ^{\frac{1}{2} - ϵ}} .

(f, μ) = \frac{( f g , q )}{( g , q )}

(f, μ) = \frac{( f g , q )}{( g , q )}

(f, μ^{M}) = \frac{( f [ T ^{M} \circ g ^{N} ] , q ^{M} )}{( T ^{M} \circ g ^{N} , q ^{M} )}

(f, μ^{M}) = \frac{( f [ T ^{M} \circ g ^{N} ] , q ^{M} )}{( T ^{M} \circ g ^{N} , q ^{M} )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTarget Tracking and Data Fusion in Sensor Networks · Distributed Sensor Networks and Detection Algorithms · Fault Detection and Control Systems

Full text

Analysis of a nonlinear importance sampling scheme for Bayesian parameter estimation in state-space models

Joaquín Míguez†

[email protected]

Inés P. Mariño⋆

[email protected]

Manuel A. Vázquez†

[email protected]

†Department of Signal Theory & Communications, Universidad Carlos III de Madrid. Avenida de la Universidad 30, 28911 Leganés, Madrid, Spain.

⋆Department of Biology and Geology, Physics and Inorganic Chemistry, Universidad Rey Juan Carlos. C/ Tulipán s/n, 28933 Móstoles, Madrid, Spain.

Abstract

The Bayesian estimation of the unknown parameters of state-space (dynamical) systems has received considerable attention over the past decade, with a handful of powerful algorithms being introduced. In this paper we tackle the theoretical analysis of the recently proposed nonlinear population Monte Carlo (NPMC). This is an iterative importance sampling scheme whose key features, compared to conventional importance samplers, are (i) the approximate computation of the importance weights (IWs) assigned to the Monte Carlo samples and (ii) the nonlinear transformation of these IWs in order to prevent the degeneracy problem that flaws the performance of conventional importance samplers. The contribution of the present paper is a rigorous proof of convergence of the nonlinear IS (NIS) scheme as the number of Monte Carlo samples, $M$ , increases. Our analysis reveals that the NIS approximation errors converge to 0 almost surely and with the optimal Monte Carlo rate of $M^{-\frac{1}{2}}$ . Moreover, we prove that this is achieved even when the mean estimation error of the IWs remains constant, a property that has been termed exact approximation in the Markov chain Monte Carlo literature. We illustrate these theoretical results by means of a computer simulation example involving the estimation of the parameters of a state-space model typically used for target tracking.

keywords:

Importance sampling; population Monte Carlo; state space models; Bayesian inference; adaptive importance sampling; parameter estimation.

\usetkzobj

all

1 Introduction

The estimation of the static unknown parameters of state-space dynamic models is a classical problem in statistical signal processing [1, 2, 3, 4, 5, 6] which has also received considerable attention, very recently, from the computational statistics community [7, 8, 9] (see also [10] for a recent survey) partly because of the ubiquity of the problem in science and engineering and partly because of the availability of more powerful computational resources to address it.

The particle Markov chain Monte Carlo (pMCMC) method originally proposed in [7] has been rapidly adopted by researchers in signal processing [11, 12, 6, 13, 14]. This is a Markov chain Monte Carlo (MCMC) algorithm [15] where the target probability density function (pdf) is the posterior density of the unknown parameters conditional on the available observations. This pdf is analytically intractable and, hence, it is approximated (for each element of the chain) via particle filtering [16, 17, 18, 19, 20]. The most popular MCMC schemes (including Metropolis and Metropolis-Hastings algorithms) admit a pMCMC implementation. A key feature of these methods is that they have the so-called exact approximation property. This means that, even if the acceptance test of the MCMC algorithm is only approximate (since the true target pdf is intractable), the stationary distribution of the Markov chain is still actual posterior density of the parameters. While popular, pMCMC procedures suffer from the same limitations as regular MCMC schemes [15, 21]:

Convergence of the chain is purely asymptotic (no convergence rates are known) and potentially very slow (a problem made worse by the particle approximation).

2.

The Monte Carlo samples in the chain are correlated, which reduces the accuracy of estimators compared to methods that produce independent samples.

3.

If the target pdf is multimodal, MCMC algorithms may get trapped in local maxima of the function.

An alternative to pMCMC methods is to employ schemes based on importance sampling (IS) [21]. This class of techniques includes population Monte Carlo (PMC) [22], the sequential Monte Carlo square (SMC2) of [23] or the nested particle filter of [9]. PMC is an iterative IS scheme in which the proposed functions used to generate Monte Carlo samples (and, hence, to approximate the posterior probability distribution of the unknown parameters) are improved across the iterations of the algorithm. See [24, 25, 26, 27] for recent applications, and new developments, of this methodology in statistical signal processing. SMC2 is a generalisation of the iterative batch importance sampling (IBIS) algorithm of [28]. It mimics the standard particle filter, but the Monte Carlo samples are drawn from the space of the (static) parameters and they are sequentially updated using a pMCMC kernel. All these methods, including SMC2, are batch, meaning that the whole record of observations is typically processed many times. A purely recursive version of the SMC2 algorithm has been proposed in [9]. The reduction in computational complexity, however, is obtained at the expense of a reduction in the convergence rate of the algorithm. It is worth mentioning that all these techniques (including pMCMC) can be fit within the theoretical framework of sequential Monte Carlo samplers introduced in [29].

The key feature of IS-based methods is that the Monte Carlo samples (used to approximate the target distribution) are generated from almost-arbitrary proposal functions and then assigned importance weights (IWs). While this is a very flexible approach, it suffers from the well-known problem of degeneracy of IWs [30, 18, 21, 8]: when the target pdf is concentrated in a very small region of the space of the unknowns, the largest IW tends to be orders of magnitude greater than all other IWs. As a result the IS-based scheme practically yields a degenerate one-sample approximation.

In this paper we address the analysis of the nonlinear population Monte Carlo (NPMC) algorithm proposed in [8]. In the latter scheme, the IWs undergo a nonlinear transformation to control their variance and, in this way, mitigate the degeneracy problem. In [8] it was proved that the approximation of the target distribution produced at each iteration of the NPMC method converges asymptotically, with the number of Monte Carlo samples $M$ , and almost surely (a.s.). Therefore, the weight transformation preserves asymptotic convergence, while it has been shown through numerical examples that performance for finite $M$ is consistently improved compared to conventional PMC procedures. The analysis in [8], however

relies on the exact computation of the IWs, which is not feasible for general state-space models,

2.

and does not provide explicit convergence rates111Error rates are found in [8] for convergence in probability (not for almost sure convergence) when the IWs are computed exactly.

In this paper we analyse the performance of NPMC methods for the Bayesian estimation the unknown parameters of state space models. Based on some unbiasedness properties of particle filters, we prove that IS with nonlinearly-transformed IWs also yields asymptotic convergence when the weights are approximate, i.e., computed via a particle filter with a fixed computational budget that introduces non-vanishing errors. In other words, we prove that the nonlinear importance sampler enjoys the same exact approximation property as pMCMC and SMC2 algorithms. Moreover, the analysis of this paper also extends considerably the results of [8] by obtaining an explicit (and almost sure) estimation error rate of order $M^{-\frac{1}{2}+\epsilon}$ , where $\epsilon>0$ is an arbitrarily small constant. This result holds for approximate weights and under mild assumptions typical of classic IS analyses. It is worth mentioning that the analytical approach developed in this paper can be applied, in a rather natural way, to the study of recently proposed PMC-like algorithms [25, 31] when the target distribution is the posterior density of the parameters of a state space model.

The rest of the paper is organised as follows. The necessary background material, including notation, state-space models and particle filters, is presented in Section 2. The nonlinear IS scheme and its iterative implementation (the NPMC algorithm) are detailed in Section 3 for the case in which the target probability distribution is the posterior distribution of the unknown parameters of a state-space model. In Section 4 we introduce the new analytical results on the convergence of nonlinear importance samplers, which is the main contribution of the paper. We illustrate the exact approximation property, and numerically compare the NPMC algorithm with a pMCMC scheme through computer simulations for a target tracking model in Section 5. Finally, some brief concluding remarks are made in Section 6.

2 Background and problem statement

2.1 State-space model

A Markov state-space model consists of two sequences of random variables (r.v.’s), $\{{\bf x}_{n}\}_{n\geq 0}$ and $\{{\bf y}_{n}\}_{n\geq 1}$ . The first sequence, $\{{\bf x}_{n}\}$ , is termed the system state. We assume it takes values on some space ${\mathcal{X}}\subseteq\mathbb{R}^{d_{x}}$ , hence ${\bf x}_{n}$ is a random $d_{x}\times 1$ vector. The state dynamics are described by a prior probability measure ${\mathcal{K}}_{0}({\sf d}{\bf x}_{0})$ and a sequence of Markov kernels ${\mathcal{K}}_{n,\theta}({\sf d}{\bf x}_{n}|{\bf x}_{n-1})$ that depend on a parameter vector $\theta\in{\sf S}\subset\mathbb{R}^{d_{\theta}}$ . In this paper, $\theta$ is assumed unknown and modelled as a random vector, with prior pdf $p_{0}(\theta)$ with respect to (w.r.t.) the Lebesgue measure. The support set of the parameter vector, ${\sf S}$ , is assumed to be compact.

The state ${\bf x}_{n}$ cannot be observed directly. Instead, some noisy observations ${\bf y}_{n}\in\mathcal{Y}\subseteq\mathbb{R}^{d_{y}}$ , $n=1,2,\ldots$ , are collected. We note that ${\bf y}_{n}$ is a $d_{y}\times 1$ vector, with $d_{y}\neq d_{x}$ in general.

We assume that the observations are conditionally independent given the system states and the parameter vector $\theta$ , with a conditional pdf w.r.t. the Lebesgue measure, denoted $l_{n,\theta}({\bf y}_{n}|{\bf x}_{n})>0$ , which depends on the parameter vector $\theta$ as well.

2.2 The optimal filter and its Monte Carlo approximation

Let ${\bf y}_{1:n}=\{{\bf y}_{1},\ldots,{\bf y}_{n}\}$ denote the sequence of observations collected up the time $n$ . The posterior probability measure of the state ${\bf x}_{n}$ conditional on the observations ${\bf y}_{1:n}$ and the parameter vector $\theta$ is denoted $\pi_{n,\theta}$ , i.e., for any Borel set $A\subset{\mathcal{X}}$ ,

[TABLE]

is the posterior probability of the event “ ${\bf x}_{n}\in A$ ”, given $\theta$ and ${\bf y}_{1:n}$ .

Similarly, $\xi_{n,\theta}$ denotes the posterior probability measure of ${\bf x}_{n}$ conditional on $\theta$ and ${\bf y}_{1:n-1}$ (i.e., not including ${\bf y}_{n}$ ). This is often referred to as the one-step-ahead predictive measure ([32], Chapter 10). For a Borel set $A\subset{\mathcal{X}}$ ,

[TABLE]

is the posterior probability of the event “ ${\bf x}_{n}\in A$ ”, given $\theta$ and ${\bf y}_{1:n-1}$ .

We refer to $\pi_{n,\theta}$ as the optimal filter conditional on the parameter vector $\theta$ . It is not possible, in general, to obtain either $\pi_{n,\theta}$ or $\xi_{n,\theta}$ in closed-form (with the notable exception of linear-Gaussian state space models, for which $\pi_{n,\theta}$ and $\xi_{n,\theta}$ are computed recursively and exactly using the Kalman flter [33]) and, therefore, numerical approximation algorithms are needed. One of the most popular schemes is the standard particle filter, also known as bootstrap filter (BF) [16, 34, 18].

The BF with $N$ particles (i.e., Monte Carlo samples on the state space ${\mathcal{X}}$ ) conditional on a given parameter vector $\theta$ can be briefly outlined as follows.

Initialisation. Draw $N$ samples ${\bf x}_{0}^{1},\ldots,{\bf x}_{0}^{N}$ from the prior distribution ${\mathcal{K}}({\sf d}{\bf x}_{0})$ . The particle approximation of $\pi_{0,\theta}({\sf d}{\bf x}_{0})\equiv{\mathcal{K}}_{0}({\sf d}{\bf x}_{0})$ is

[TABLE]

where $\delta_{{\bf x}_{0}^{i}}$ denotes the Dirac delta measure centred at ${\bf x}_{0}^{i}\in{\mathcal{X}}$ . 2. 2.

Recursive step. Given the approximation $\pi_{n-1,\theta}^{N}({\sf d}{\bf x}_{n-1})=\frac{1}{N}\sum_{i=1}^{N}\delta_{{\bf x}_{n-1}^{i}}({\sf d}{\bf x}_{n-1})$ , take the following steps:

(a)

Randomly propagate each particle using the Markov kernel in the model, i.e., draw $\tilde{\bf x}_{n}^{i}$ from ${\mathcal{K}}_{n,\theta}({\sf d}{\bf x}_{n}|{\bf x}_{n-1}^{i})$ , $i=1,...,N$ . 2. (b)

Compute IWs, $\tilde{u}_{n}^{i}=l_{n,\theta}({\bf y}_{n}|\tilde{\bf x}_{n}^{i})$ , for $i=1,...,N$ , and 3. (c)

normalise them as

[TABLE] 4. (d)

Resample: draw $N$ times independently from the discrete distribution $\tilde{\pi}_{n,\theta}^{N}({\sf d}{\bf x}_{n})=\sum_{i=1}^{N}u_{n}^{i}\delta_{\tilde{\bf x}_{n}^{i}}({\sf d}{\bf x}_{n})$ and denote the resulting samples as $\{{\bf x}_{n}^{i}\}_{i=1}^{N}$ . Construct the unweighted approximation $\pi_{n,\theta}^{N}({\sf d}{\bf x}_{n})=\frac{1}{N}\sum_{i=1}^{N}\delta_{{\bf x}_{n}^{i}}({\sf d}{\bf x}_{n})$ .

The resampling step (d) above can be implemented in a number of different ways (see, e.g., [35, 32] or [20] for a brief survey of methods). Here, for simplicity, we have adopted a scheme which is often referred to as multinomial resampling [18, 35] but most asymptotic convergence results hold true for several other schemes as well [36, 32]. The measure-valued r.v. $\pi_{n,\theta}^{N}$ is an approximation of the optimal filter $\pi_{n,\theta}$ (conditional on $\theta$ ). Let us use the shorthand

[TABLE]

for the integral of a real function $f:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ w.r.t. a measure $\pi$ . Under very mild assumptions it can be shown that

[TABLE]

almost surely (a.s.) for any bounded function $f:{\mathcal{X}}\rightarrow{\mathbb{R}}$ [36, 32]. Moreover, if we denote $||f||_{\infty}=\sup|f({\bf x})|$ , $E[Z]$ indicates the expected value of a r.v. $Z$ and $||Z||_{p}=(E[|Z|^{p}])^{\frac{1}{p}}$ is its $L_{p}$ norm ( $p\geq 1$ ), then it can be proved [37] that

[TABLE]

where $C_{n}$ is a constant independent of $N$ and

[TABLE]

The algorithm also produces a Monte Carlo approximation of the predictive measure $\xi_{n,\theta}$ , namely

[TABLE]

If we write ${\bf y}={\bf y}_{1:n}$ for the complete sequence of observations up to time $n$ , it turns out that the conditional pdf of ${\bf y}$ given the parameter vector $\theta$ , denoted $\ell({\bf y}|\theta)$ , can be written in terms of integrals w.r.t. to the predictive measures $\xi_{k,\theta}$ , $k=1,\ldots,n$ . To be specific,

[TABLE]

where

[TABLE]

The conditional pdf $\ell({\bf y}|\theta)$ is the likelihood of the parameter vector $\theta$ given the available data ${\bf y}$ and the BF yields the straightforward estimator

[TABLE]

which can be shown to be unbiased (i.e., $E[\ell^{N}({\bf y}|\theta)]=\ell({\bf y}|\theta)$ ) under very mild assumptions ([36], Theorem 7.4.2).

2.3 Problem statement

Let ${\bf y}_{=}\{{\bf y}_{1},\ldots,{\bf y}_{R}\}$ be the available data set, with $R<\infty$ . Our goal is to approximate the probability measure associated to the posterior pdf of the parameter vector, $\theta$ , given the data, ${\bf y}$ . We denote this pdf as $p(\theta|{\bf y})$ and it is straightforward to show, using Bayes’ theorem, that

[TABLE]

where, we recall, $p_{0}(\theta)$ is the prior pdf of $\theta$ .

In the next section, we describe an iterative importance sampling algorithm, originally introduced in [8], for the approximation of $p(\theta|{\bf y}){\sf d}\theta$ .

3 Algorithm

The NPMC algorithm of [8] is an iterative importance sampling (IS) scheme that seeks to approximate a target probability distribution, in our case given by the posterior pdf $p(\theta|{\bf y})$ , using weighted Monte Carlo samples. It generates a sequence of proposal pdf’s $q_{k}(\theta)$ , $k=1,\ldots,K$ , from which samples can be drawn and importance weights (IWs) can be computed. This sequence of proposals is expected to yield increasingly better approximations of the target as the algorithm converges. The key feature of the NPMC method, which departs from the classical PMC technique of [22], is to compute a set of transformed importance weights (TIWs) by applying a nonlinear function to the standard IWs. The aim of this transformation is to mitigate the well-known problem of the degeneracy of the IWs (common to many IS methods, see [18, 8]) by controlling the weight variability.

For the case of general state space models, an additional difficulty encountered when trying to estimate the unknown model parameters (denoted $\theta$ in our setup) is that the likelihood $\ell({\bf y}|\theta)$ is intractable. In the last few years, though, it has become a common approach to approximate this likelihood via particle filtering (PF) (see, e.g., [8, 7, 38, 23]). To be specific, we let $\ell^{N}({\bf y}|\theta)$ stand for the approximation of $\ell({\bf y}|\theta)$ computed using a standard bootstrap filter (BF) [16, 39] with $N$ particles (see equation (12) in Section 2.2). One key feature of this approach, that we exploit for our analysis in Section 4, is that $\ell^{N}({\bf y}|\theta)$ can be proved to be an unbiased estimator of $\ell({\bf y}|\theta)$ [36, 40].

The NPMC algorithm applied to a state space model, with $K$ iterations, $M$ Monte Carlo samples per iteration, plain Gaussian proposals $\{q_{k}\}_{k\geq 1}$ , and approximate likelihoods is outlined below.

Initialisation. Draw $M$ i.i.d. samples $\theta_{0}^{1},\theta_{0}^{2},\ldots,\theta_{0}^{M}$ from the prior pdf $p_{0}(\theta)$ . Then,

compute non-normalised IWs $\tilde{\sf w}_{0}^{i}\propto\ell^{N}({\bf y}|\theta_{0}^{i})$ , $i=1,...,M$ , 2. 2.

compute TIWs as $\hat{\sf w}_{0}^{i}={\mathcal{T}}_{M}\left(i,\{\tilde{\sf w}_{0}^{j}\}_{j=1}^{M}\right)$ , where ${\mathcal{T}}_{M}:\{1,\ldots,M\}\times\{\tilde{\sf w}_{0}^{j}\}_{j=1}^{M}\rightarrow[0,+\infty)$ is a nonlinear transformation, and 3. 3.

normalise the TIWs, ${\sf w}_{0}^{i}=\frac{\hat{\sf w}_{0}^{i}}{\sum_{j=1}^{M}\hat{\sf w}_{0}^{j}}$ , $i=1,...,M$ .

Iteration. For $k=1,\ldots,K$ , take the following steps:

Let $q_{k}(\theta)={\cal N}(\theta|\mu_{k},\Sigma_{k})$ be a multivariate Gaussian pdf with mean vector and covariance matrix obtained, respectively, as

[TABLE]

Note that the random variates $\theta_{k-1}^{i}$ , $i=1,...,M$ , are $d_{\theta}\times 1$ vectors. The superscript ⊤ denotes transposition. 2. 2.

Draw i.i.d. samples $\theta_{k}^{i}$ , $i=1,...,M$ , from $q_{k}(\theta)$ . 3. 3.

Compute IWs, $\tilde{\sf w}_{k}^{i}=\frac{\ell^{N}({\bf y}|\theta_{k}^{i})p_{0}(\theta_{k}^{i})}{q_{k}(\theta_{k}^{i})}$ , $i=1,...,M$ . 4. 4.

Compute TIWs, $\hat{\sf w}_{k}^{i}={\mathcal{T}}_{M}\left(i,\{\tilde{\sf w}_{k}^{j}\}_{j=1}^{M}\right)$ , $i=1,...,M$ , using the same nonlinear map as for $k=0$ . 5. 5.

Normalise the TIWs, ${\sf w}_{k}^{i}=\frac{\hat{\sf w}_{k}^{i}}{\sum_{j=1}^{M}\hat{\sf w}_{k}^{j}}$ , $i=1,...,M$ .

Following [8], the nonlinear map ${\mathcal{T}}_{M}$ of choice is a “clipping” transformation. In particular, let $i_{1},i_{2},...,i_{M}$ be a permutation of the indices $1,2,...,M$ such that the IWs become ordered, namely $\tilde{\sf w}_{k}^{i_{1}}\geq\tilde{\sf w}_{k}^{i_{2}}\geq\cdots\geq\tilde{\sf w}_{k}^{i_{M}}$ . The clipping transformation ${\mathcal{T}}_{M}$ , with parameter $1\leq M_{c}\leq\sqrt{M}$ , flattens the $M_{c}$ largest IWs and makes them equal to the $M_{c}$ -th non-normalised IW, $\tilde{\sf w}_{k}^{i_{M_{c}}}$ . Specifically, for each $j=1,...,M$ , we obtain

[TABLE]

Other choices of ${\mathcal{T}}_{M}$ are possible (e.g., tempering schemes) but clipping has been found particularly effective in practice [8]. The choice of Gaussian proposals (in step 1 of the Iteration) is made merely for simplicity. Other (more efficient) possibilities exist, but we stick to this formulation as it is sufficient for the purpose of this paper.

Given $A\subseteq{\sf S}$ , being ${\sf S}$ the support set of the parameter vector $\theta$ described in Section 2, let $\mu_{\bf y}(A)=\int_{A}p(\theta|{\bf y}){\sf d}\theta$ denote the posterior probability measure (conditional on the observed data ${\bf y}$ ) associated to the parameter vector $\theta$ . This measure yields the full probabilistic description of $\theta$ given the available observations. If $\mu_{\bf y}$ is available, then we can compute various types of estimators and assess the associated errors. For example, the posterior-mean estimator is

[TABLE]

and it minimises the mean square error (MSE). For an arbitrary estimator $\hat{\theta}$ , the MSE can also be written as an integral w.r.t. $\mu_{\bf y}({\sf d}\theta)$ , namely,

[TABLE]

The proposed NPMC algorithm yields a sequence of importance sampling (i.e., weighted Monte Carlo) approximations of $\mu_{\bf y}({\sf d}\theta)$ . To be specific, at each iteration $k$ we obtain the random probability measure

[TABLE]

where $\delta_{\theta_{k}^{i}}$ denotes the Dirac delta measure centred at $\theta_{k}^{i}$ . Using $\mu_{{\bf y},k}^{M}({\sf d}\theta)$ we can approximate any parameter estimator. For instance, $\hat{\theta}^{M}_{k}=\sum_{i=1}^{M}{\sf w}_{k}^{i}\theta_{k}^{i}$ is the approximation of the posterior mean $\hat{\theta}_{*}$ . The corresponding minimum MSE can also be approximately computed as

[TABLE]

In the next section we analyse the convergence of the approximate measure $\mu_{{\bf y},k}^{M}$ as $M\rightarrow\infty$ in a single iteration (i.e., for a given $k$ ) when the number of particles $N$ used to approximate the likelihood via the BF (i.e., the estimate $\ell^{N}({\bf y}|\theta)$ of $\ell({\bf y}|\theta)$ ) is kept constant and finite.

4 Analysis

Consider a single iteration $k$ in the NPMC algorithm, with a fixed importance density $q_{k}\equiv q$ . We refer to the random measure $\mu_{{\bf y},k}^{M}({\sf d}\theta)=\sum_{i=1}^{M}{\sf w}_{k}^{i}\delta_{\theta_{k}^{i}}({\sf d}\theta)$ computed via the TIWs ${\sf w}_{k}^{i}$ , $i=1,..,M$ , as a nonlinear importance sampling (NIS) approximation of $\mu_{\bf y}({\sf d}\theta)$ . Our aim in this section is to assess whether $\mu_{{\bf y},k}^{M}({\sf d}\theta)$ converges towards the true measure $\mu_{\bf y}({\sf d}\theta)$ or not as $M\rightarrow\infty$ . To do this, there are two issues that need to be handled and make the analysis more difficult compared to a conventional IS method (that relies on the standard IWs, rather than the TIWs). These issues are:

(i)

the distortion in the Monte Carlo approximation due to the clipping of the weights, which introduces additional bias (compared to the use of standard IWs); and

(ii)

the impossibility to compute the IWs, and hence the TIWs, exactly, since the likelihood $\ell({\bf y}|\theta)$ is intractable and we work with the particle approximation $\ell^{N}({\bf y}|\theta)$ instead.

In [8] it was proved that, when the IWs can be computed exactly, the NIS approximation converges almost surely (a.s.) towards the target probability measure as $M\rightarrow\infty$ , which accounts for (i) above222The analysis of [8] does not provide an error rate, though. Such rate is explicitly derived in this paper. The problem of the approximate computation of the weights was partially addressed in [41], for a relatively simple case where the errors in the IWs where assumed deterministic and bounded. However, the estimation problem studied in [41] (parameter estimation for $\alpha$ -stable distributions using iid data) did not involve any dynamics and the convergence analysis only showed an upper bound for the approximation errors that included a deterministic constant, namely a non-vanishing term proportional to the approximation error of the IWs.

Here, we show stronger analytical results that ensure the almost sure convergence of the NIS approximation when $M\rightarrow\infty$ and the likelihood function can only be estimated as $\ell^{N}({\bf y}|\theta)$ , i.e., using a BF with a finite and fixed number of particles $N$ . Under assumptions which are standard in the classical IS theory, we prove that integrals of the form $\int f(\theta)\mu_{{\bf y},k}^{M}({\sf d}\theta)$ converge towards $\int f(\theta)\mu_{{\bf y},k}({\sf d}\theta)$ a.s. as $M\rightarrow\infty$ and provide explicit error rates.

4.1 Notation

Since we focus our attention in the NIS scheme alone, i.e., a single iteration of the proposed algorithm, in the remaining of this section we drop the iteration index $k$ . Hence, we assume a fixed importance density $q(\theta)$ , from where $M$ independent Monte Carlo samples, $\theta^{1},\theta^{2},\ldots,\theta^{M}$ , are drawn. Since the observations ${\bf y}$ are assumed arbitrary but fixed, we drop them from the likelihood notation and write

[TABLE]

Similarly, we simplify the notation for the posterior pdf and write $p(\theta)=p(\theta|{\bf y})$ and $\mu({\sf d}\theta)=\mu_{\bf y}({\sf d}\theta)$ . Then, the non-normalised IWs are approximated as

[TABLE]

where we have introduced the weight function $g^{N}\triangleq\ell^{N}p_{0}/q$ as a shorthand. This weight function is a random approximation of the deterministic function $g=\ell p_{0}/q$ . The support of $g$ is the same as the support of $q$ , $\ell$ and $p_{0}$ , denoted ${\sf S}\subseteq\mathbb{R}^{d_{\theta}}$ . We assume that $g(\theta)>0$ for every $\theta\in{\sf S}$ as well (a standard assumption in classical IS). It is also apparent that $p\propto gq$ , where $p$ is the posterior pdf, and the proportionality constant is independent of $\theta$ .

The non-normalised TIWs computed via the clipping function (15) are denoted

[TABLE]

where $\circ$ represents function composition and we omit the index argument of (15) for conciseness (its value is clear from the notation in any case). The normalised TIWs are ${\sf w}^{i}=\frac{\hat{\sf w}^{i}}{\sum_{j=1}^{M}\hat{\sf w}^{j}}$ , and they are used to compute the approximate measure $\mu^{M}({\sf d}\theta)=\sum_{i=1}^{M}\delta_{\theta^{i}}({\sf d}\theta){\sf w}^{i}$ .

4.2 Assumptions and a preliminary result

Let the state sequence $\{{\bf x}_{n}\}_{n\geq 0}$ take values on ${\mathcal{X}}\subseteq\mathbb{R}^{d_{x}}$ . We make the following classical assumptions on the conditional pdf of the observations ${\bf y}_{n}$ , $n=1,2,\ldots,R$ , the prior density of the parameters, $p_{0}(\theta)$ , and the importance function $q(\theta)$ .

Assumption 1

The observation sequence ${\bf y}_{1:R}$ is arbitrary but fixed. The functions $l_{n}({\bf y}_{n}|\cdot):{\mathcal{X}}\rightarrow(0,\infty)$ , $n=1,2,...,R$ , are uniformly bounded, i.e., there exists a finite and positive constant $\|l\|_{\infty}$ such that

[TABLE]

Assumption 2

The ratio of pdf’s $\frac{p_{0}(\theta)}{q(\theta)}$ is bounded on ${\sf S}$ , i.e., there exists a positive and finite constant $\left\|\frac{p_{0}}{q}\right\|_{\infty}$ such that

[TABLE]

Remark 1

If the parameter support set ${\sf S}$ is compact, then A.1 and A. 2 hold naturally for most models of practical interest.

The following lemma plays a key role in the asymptotic convergence analysis of the approximation $\mu^{M}({\sf d}\theta)$ . It states that $\ell^{N}(\theta)$ is an unbiased estimator of the likelihood $\ell(\theta)$ and enables us to show that the NIS scheme converges when $M\rightarrow\infty$ , even if the number of particles $N$ in the approximation $\ell^{N}(\theta)$ remains finite and constant.

Lemma 1

If Assumption 1 holds then

[TABLE]

independently of $N$ .

Proof. From the definition of $\ell(\theta)$ in Eq. (10) and its estimator $\ell^{N}(\theta)$ in Eq. (12), it is clear that both $\ell(\theta)\leq\|l\|_{\infty}^{R}$ and $\ell^{N}(\theta)\leq\|l\|_{\infty}^{R}$ when $R$ is the number of available observations. The fact that $\ell^{N}(\theta)$ is unbiased is a consequence of [36, Theorem 7.4.2] (see also [40, Lemma 2] for an alternative proof that does not rely on the Feynmann-Kac framework). \qed

4.3 Asymptotic convergence, error rates and exact approximation

In the sequel we look into the approximation of integrals of the form

[TABLE]

where $f$ is a bounded real function on the parameter space ${\sf S}$ . We use $\|f\|_{\infty}\triangleq\sup_{\theta\in{\sf S}}|f(\theta)|<\infty$ to denote the supremum norm of a bounded function, while the set of bounded functions on ${\sf S}$ is denoted $B({\sf S})$ . The approximations of interest are

[TABLE]

for any $f\in B({\sf S})$ .

The following theorem yields an explicit upper bound for the (random) approximation error $|(f,\mu^{M})-(f,\mu)|$ . The bound is proportional to $M^{-\frac{1}{2}+\epsilon}$ (for an arbitrarily small $\epsilon>0$ ) and, therefore, it vanishes as $M\rightarrow\infty$ , independently of the number of particles $N$ used in the approximate likelihoods $\ell^{N}(\theta^{i})$ .

Theorem 1

Assume that A.1 and A.2 hold, $M_{c}\leq\sqrt{M}$ and $\int_{\sf S}\ell(\theta)p_{0}(\theta){\sf d}\theta=(\ell,p_{0})>0$ . Then, for every $\epsilon\in\left(0,\frac{1}{2}\right)$ (arbitrarily small) and every $f\in B({\sf S})$ there exists a positive and a.s. finite r.v. $V_{f,\epsilon}$ , independent of $M$ and $M_{c}$ , such that

[TABLE]

In particular, $\lim_{M\rightarrow\infty}|(f,\mu^{M})-(f,\mu)|=0$ a.s.

Proof. Recall the intractable weight function $g=\ell p_{0}/q$ and its random estimator $g^{N}=\ell^{N}p_{0}/q$ . The integral of any $f\in B({\sf S})$ w.r.t. the posterior measure $\mu({\sf d}\theta)\propto\ell(\theta)p_{0}(\theta){\sf d}\theta$ can be written as

[TABLE]

by simply noting that $g(\theta)q(\theta)=\ell(\theta)p_{0}(\theta)$ . Similarly, for the random measure $\mu^{M}({\sf d}\theta)$ we can write

[TABLE]

where $q^{M}({\sf d}\theta)=\frac{1}{M}\sum_{i=1}^{M}\delta_{\theta^{i}}({\sf d}\theta)$ is the Monte Carlo approximation of the proposal distribution (with pdf $q(\theta)$ ) and $\circ$ denotes composition of functions, hence $[{\mathcal{T}}^{M}\circ g^{N}](\theta^{i})={\mathcal{T}}^{M}(g^{N}(\theta^{i}))$ is the transformed weight associated to $\theta^{i}$ .

Given equations (29) and (30) it is straightforward to show that

[TABLE]

Since $(f,\mu^{M})\leq\|f\|_{\infty}<\infty$ and $(g,q)=(\ell,p_{0})$ , where $(\ell,p_{0})>0$ by assumption, Eq. (31) readily yields

[TABLE]

and, therefore, the problem of calculating bounds for $|(f,\mu^{M})-(f,\mu)|$ reduces to the problem of computing bounds for errors of the form

[TABLE]

for $b\in B({\sf S})$ .

Choose any $b\in B({\sf S})$ . A simple triangle inequality yields

[TABLE]

It is straightforward to obtain an upper bound for the first term on the right hand side of the inequality (34). Indeed, by construction of ${\mathcal{T}}^{M}$ (see Eq. (15)) we readily obtain

[TABLE]

where the inequality follows from the bound $g^{N}\leq\|l\|_{\infty}^{R}\left\|\frac{p_{0}}{q}\right\|_{\infty}$ , which is a straightforward consequence of assumptions A.1 and A.2 and the definition of the estimate $\ell^{N}$ produced by the BF (see Eq. (12)).

Finding a suitable bound for the second term on the right hand side of the inequality (34) takes some more effort. Choose, again, any $b\in B({\sf S})$ . A simple triangle inequality yields

[TABLE]

Since $q^{M}=\frac{1}{M}\sum_{i=1}^{M}\delta_{\theta^{i}}$ , for the second term on the right hand side of (36) we can write

[TABLE]

where the r.v.’s

[TABLE]

are independent, with zero mean (recall the $\theta^{(i)}$ ’s are i.i.d. draws from $q$ ) and bounded, because $b$ is bounded and A.1 and A.2 imply that $g<\|l\|_{\infty}^{R}\times\left\|\frac{p_{0}}{q}\right\|_{\infty}<\infty$ . Therefore, it is an exercise in combinatorics to show that

[TABLE]

where $\tilde{c}$ is a constant independent of $M$ and $q$ . Combining (38) with (37) readily yields

[TABLE]

The inequality (39) implies that there exists an a.s. finite r.v. $\tilde{U}_{b,\epsilon}>0$ such that

[TABLE]

where $0<\epsilon<\frac{1}{2}$ is an arbitrarily small constant independent of $M$ (see [42, Lemma 4.1]).

If we expand the first term on the right hand side of (36) we arrive at

[TABLE]

where the r.v.’s $Z_{N}^{i}=\frac{b(\theta^{i})p_{0}(\theta^{i})}{q(\theta^{i})}\left(\ell^{N}(\theta^{i})-\ell(\theta^{i})\right)$ , $i=1,2,...,M$ , are independent (because the samples $\theta^{1},\ldots,\theta^{M}$ are independent) and zero mean, as a result of Lemma 1333Note that $E\left[Z_{N}^{i}|\theta^{i}\right]=\frac{b(\theta^{i})p_{0}(\theta^{i})}{q(\theta^{i})}E\left[\ell^{N}(\theta^{i})-\ell(\theta^{i})\right]=0$ , because $\ell^{N}(\theta^{i})$ is an unbiased estimator of $\ell(\theta^{i})$ , hence $E\left[Z_{N}^{i}\right]=E\left[E\left[Z_{N}^{i}|\theta^{i}\right]\right]=0$ .. Since they are also bounded, namely $|Z_{N}^{i}|\leq\|b\|_{\infty}\|l\|_{\infty}^{R}\left\|\frac{p_{0}}{q}\right\|_{\infty}$ as a consequence of A.1 and A.2, it is again an exercise to show that (41) implies

[TABLE]

in the same manner as we obtained the inequality (38). Resorting again to [42, Lemma 4.1], from (42) we deduce that there exists an a.s. finite r.v. $\bar{U}_{b,\epsilon}>0$ , independent of $M$ , such that

[TABLE]

where $0<\epsilon<\frac{1}{2}$ is an arbitrarily small constant independent of $M$ .

Taking together (36), (40) and (43) we arrive at

[TABLE]

where $U_{b,\epsilon}=\tilde{U}_{b,\epsilon}+\bar{U}_{b,\epsilon}\geq 0$ is an a.s. finite r.v. independent of $M$ , and $\epsilon\in\left(0,\frac{1}{2}\right)$ can be chosen to be arbitrarily small.

Substituting the inequalities (35) and (44) back into the relation (34) we arrive at the bound

[TABLE]

where the second inequality follows from the assumption $M_{c}\leq\sqrt{M}$ and choosing $\tilde{V}_{b,\epsilon}=2\|l\|_{\infty}^{R}\left\|\frac{p_{0}}{q}\right\|_{\infty}\|b\|_{\infty}+U_{b,\epsilon}$ . Since the r.v. $U_{b,\epsilon}$ is a a.s. finite, $\tilde{V}_{b,\epsilon}<\infty$ a.s. as well.

To conclude the proof, we substitute the inequality (45) twice into the relation (32). To be precise, we choose $b=f$ first and use (45) to obtain a bound for the first term on the right hand side of (32). Then, we choose $b=1$ and apply (45) again to find a bound for the second term on the right hand side of (32). As a result, we arrive at

[TABLE]

Since $(\ell,p_{0})>0$ by assumption of Theorem 1, taking

[TABLE]

leads to the desired result and concludes the proof. \qed

Theorem 1 is a general result regarding nonlinear importance sampling. It holds true for any problem involving the approximation of the posterior probability distribution of the unknown parameters of a state space model as long as Assumptions 1 and 2 hold. These assumptions, in turn, are very mild and amount to the classical assumptions in the analysis of standard IS algorithms.

Remark 2

We draw attention to the fact that the error $|(f,\mu^{M})-(f,\mu)|$ vanishes a.s. when $M\rightarrow\infty$ even if the number of particles $N$ in the BF remains fixed and, hence, $\ell^{N}$ does not converge to $\ell$ . This property has been coined “exact approximation” in the MCMC literature (see [7]).

5 Computer simulations

5.1 State-space models

In order to illustrate the performance of the NPMC algorithm and the exact approximation property granted by Theorem 1 we have carried out computer simulations for the estimation of the unknown parameters in a problem consisting of the tracking of a target moving over a region monitored by a network of sensors.

5.1.1 Target dynamics

The target moves over a closed rectangular region ${\mathcal{R}}=[-20,+20]\times[-10,+10]$ . When it hits the border of ${\mathcal{R}}$ , the target bounces back in according to the law of reflection [43]. The state of the system at time $n$ is ${\bf x}_{n}=\left[\begin{array}[]{c}{\bf r}_{n}\\ {\bf v}_{n}\\ \end{array}\right]\in\mathbb{R}^{4},$ where ${\bf r}_{n}\in{\mathcal{R}}$ is the target position and ${\bf v}_{n}$ its velocity. At time $n=0$ , we assume a uniform prior on ${\mathcal{R}}$ for the position and a zero-mean Gaussian distribution for the velocity. To be specific, the prior probability measure is defined as

[TABLE]

where ${\bf I}_{2}$ is the $2\times 2$ identity matrix, ${\mathcal{U}}({\mathcal{R}})$ is the uniform distribution on ${\mathcal{R}}$ and ${\mathcal{N}}({\bf m},{\bf C})$ denotes the Gaussian distribution with mean $\bf m$ and covariance matrix $\bf C$ .

At time $n>0$ , the state vector ${\bf x}_{n}$ evolves according to a linear-Gaussian equation if the target position remains within the bounded region ${\mathcal{R}}$ but it “reflects” back in when the target reaches a border of ${\mathcal{R}}$ . Specifically, let

[TABLE]

where ${\bf u}_{n}\sim{\mathcal{N}}({\bf 0},{\bf C})$ is a Gaussian noise term with [math]-mean and covariance matrix

[TABLE]

$\kappa$ is a time-discretisation step (we assume $\kappa=1$ in our simulations), $\sigma_{u}^{2}$ is a velocity variance parameter, and $\sigma_{z}^{2}$ is a position variance parameter. The latter are assumed known and identical, $\sigma_{u}^{2}=\sigma_{z}^{2}=10^{-2}$ . If $\tilde{\bf x}_{n}$ generated in this way is inside ${\mathcal{R}}$ , $\tilde{\bf x}_{n}\in{\mathcal{R}}$ , then ${\bf x}_{n}=\tilde{\bf x}_{n}$ , otherwise ${\bf x}_{n}=f({\bf x}_{n-1})$ , where $f$ is the reflection function detailed in A. Note that we do not provide an expression for the kernel ${\mathcal{K}}_{n}({\sf d}{\bf x}_{n}|{\bf x}_{n-1})$ but have just described how to draw samples from it instead. This is enough for the implementation of the bootstrap filter and the NPMC algorithm.

For illustration, Fig. 1 depicts the region ${\mathcal{R}}$ and a sample trajectory (i.e., a sequence of positions ${\bf r}_{0},{\bf r}_{1},\ldots$ ) which hits the borders of ${\mathcal{R}}$ and is reflected back in at four different times. In the figure, the starting target position is represented by a red diamond, the direction of motion is indicated by arrows and the blue squares represent the position of the sensors used to monitor the target motion.

5.1.2 Observations

There are $J$ sensors deployed in ${\mathcal{R}}$ and, at time $n$ , each sensor collects a measurement of the power of the radio signal transmitted by the target. To be specific, the observation recorded by sensor $j$ at time $n$ has the form

[TABLE]

where $P_{t}$ is the power of the transmitted radio signal, ${\bf s}_{j}$ is the location of the $j$ th sensor, $||{\bf r}_{n}-{\bf s}_{j}||$ is the distance at time $n$ between the target and the sensor, $\nu>0$ is the path loss exponent, $\rho$ is the sensitivity of the sensor, i.e., the minimum power it can measure (note that $y_{j,n}\rightarrow 10\log(\rho)+\epsilon_{j,n}$ when $||{\bf r}_{n}-{\bf s}_{j}||\rightarrow\infty$ ) and $\epsilon_{j,n}\sim{\mathcal{N}}(0,\sigma_{\epsilon}^{2})$ is a Gaussian term accounting for observational errors. We assume $\sigma_{\epsilon}^{2}=1$ is a known parameter.

At each time instant $n$ , a vector of $J$ observations ${\bf y}_{n}=[y_{1,n},y_{2,n},\ldots,y_{J,n}]^{T}\in\mathbb{R}^{J}$ is collected. The target is observed over $m$ time instants, and hence the available dataset is ${\bf y}={\bf y}_{1:m}$ . We set $m=50$ for our computer simulations.

5.1.3 Problem statement

Given the state space model described in Sections 5.1.1 and 5.1.2 above, we aim at estimating the unknown parameters $P_{t}$ , $\nu$ and $\rho$ . All other parameters (namely the discretisation period $\kappa$ and the relevant variances) are assumed known. For all computer simulations we have set ground truth values $P_{t}=0.8$ , $\nu=3$ and $\rho=10^{-5}$ for the parameters to be estimated.

Since $P_{t}>0$ and $\rho>0$ , we apply the NPMC algorithm (together with competing algorithms to be described below) to approximate the posterior probability measure $\mu_{{\bf y}}({\sf d}\theta)$ of the vector of unknowns $\theta=[\log P_{t},\nu,\log\rho]^{T}\in\mathbb{R}^{3}$ . We assume prior distributions of the form $\log P_{t}\sim{\mathcal{N}}(-0.11,0.22)$ , $\nu\sim{\mathcal{N}}(0,4)$ and $\log\rho\sim{\mathcal{N}}(-11.02,0.4)$ . Note that, in natural units, the prior mean and variance of $P_{t}$ are $1$ and $0.25$ , respectively, while for $\rho$ the prior mean and variance are $2\times 10^{-5}$ and $2\times 10^{-10}$ .

The likelihood $\ell({\bf y}|\theta)$ for the model does not have a closed form and, therefore, it is estimated using a BF, for the state space model described in Sections 5.1.1 and 5.1.2, to yield the approximation $\ell^{N}({\bf y}|\theta)$ detailed in Section 2.2.

5.2 Competing methods

We have applied to this problem the NPMC method described in Section 3, a standard PMC procedure and a particle Metropolis-Hastings (pMH) algorithm. The PMC scheme we have used is identical to the NPMC algorithm of Section 3 except that TIWs are not computed, hence all approximations rely on the conventional IWs.

The pMH is a representative of the class of particle MCMC methods [7] that have become popular in the past two years. It generates a Markov chain on the space of the unknown parameter vector $\theta$ according to the following procedure:

Draw $\theta_{0}\sim p_{0}(\theta)$ from the prior distribution of the parameters 2. 2.

At the $r$ -th iteration, and given the previous element $\theta_{r-1}$ :

(a)

Draw a tentative new element $\tilde{\theta}_{r}\sim{\mathcal{N}}(\theta_{r-1},\frac{2}{10}\bf C)$ , where both $\bf C=\text{diag}\left(\left[0.22,4,0.4\right]\right)$ and the scale factor $\frac{2}{10}$ have been empirically chosen to optimise the performance of the algorithm. 2. (b)

Compute the (approximate) likelihood $\ell^{N}({\bf y}|\tilde{\theta}_{r})$ and prior density $p_{0}(\tilde{\theta}_{r})$ . The acceptance probability for $\tilde{\theta}_{r}$ is

[TABLE] 3. (c)

Draw $u_{r}\sim{\mathcal{U}}(0,1)$ . If $u_{r}<\alpha_{r}$ then $\theta_{r}=\tilde{\theta}_{r}$ , else $\theta_{r}=\theta_{r-1}$ .

When we generate a chain of length $L$ using the procedure above we set a burn-in period of $L\over 2$ , hence estimates are computed from the samples $\theta_{\lfloor{L\over 2}\rfloor+1},\ldots,\theta_{L}$ in the chain.

To compare the pMH and PMC-like algorithms on a fair basis, we let $L=M\times K$ , where $K$ is the number of iterations of the NPMC and PMC algorithms and $M$ is the number of samples generated per iteration.

All three methods (PMC, NPMC, pMH) rely on a BF with $N$ particles for the computation of $\ell^{N}({\bf y}|\theta)$ . The value of $N$ is fixed for all algorithms as $N=400$ unless explicitly stated otherwise.

5.3 Results

Figure 2 shows the evolution of the MSE of the estimators of $\theta$ produced by the PMC, NPMC and pMH algorithms as the number of samples is increased.

The error for the NPMC algorithm is at least one order of magnitude below the errors of the conventional PMC and the pMH algorithms for every tested value of $M$ . For $M=200$ samples, for example, the MSE attained by the NPMC is $\approx 1.19\times 10^{-2}$ , while for the standard PMC and pMH algorithms the errors are $\approx 2.49\times 10^{-1}$ and $\approx 5.01$ , respectively.

Next, we aim at finding out the length of the chain, $L$ , required for the pMH algorithm to attain the same performance, in terms of MSE, as the NPMC algorithm. Figure 3 shows the MSE of the pMH method for different chain lengths (equivalently, number of generated samples).

For comparison, the performance of the NPMC algorithm for $M=500$ samples and $K=10$ iterations ( $500\times 10=5,000$ Monte Carlo samples overall) is also indicated in the plot. It can be seen that, in the pMH algorithm, chains that are around $500,000$ samples long are required to attain the same MSE as the NPMC algorithm (a 100-fold increase of the computational cost). While the parameters of the pMH scheme may be further tuned to improve this performance, the gap between the algorithms is large enough to conclude that the NMPC method is more efficient in this example.

Finally, we examine the exact approximation property of the NPMC scheme stated by Theorem 1. Figure 4 shows the MSE of the NPMC algorithm versus the number of Monte Carlo samples, $M$ , for different values of $N$ (the number of particles used by the BF to approximate the IWs). While Theorem 1 guarantees that the approximation errors vanish as $M\rightarrow\infty$ , even if $N$ is fixed, it is reasonable to expect that for a fixed $M<\infty$ , greater values of $N$ lead to better performance. This is shown, indeed, by Fig. 4. Note, however, that the difference in performance is very small. For $M=1,000$ , the gap between the MSE of the NPMC scheme with $N=400$ and the NPMC scheme with $N=50$ is $\approx 6\times 10^{-3}$ .

6 Conclusion

We have rigorously proved, under mild assumptions, that nonlinear importance samplers with clipped IWs converge a.s. with optimal Monte Carlo error rates even when the weights can only be estimated (and have a positive, non-vanishing variance) as long as these estimates are unbiased. Therefore, nonlinear importance samplers can perform exact approximation in the same manner as, e.g., particle MCMC schemes. Besides the theoretical contribution, we have numerically shown that the proposed algorithm can be more efficient than a particle Metropolis-Hastings algorithm of the same complexity for inference on a target tracking model.

Acknowledgments

This research has been partially supported by the Spanish Ministry of Economy and Competitiveness (projects TEC2015- 69868-C2-1-R ADVENTURE and FIS2013-40653-P), the Spanish Ministry of Education, Culture and Sport (mobility award PRX15/00378) and the Office of Naval Research (ONR) Global (Grant Award no. N62909-15-1-2011).

Appendix A Definition of function $f(\cdot)$

Let us denote the upper right, upper left, lower left and lower right vertices of the monitored region by, respectively, ${\bf c}_{0}$ , ${\bf c}_{1}$ , ${\bf c}_{2}$ and ${\bf c}_{3}$ . The sides of the rectangle, obtained by joining adjacent vertices, are denoted ${\bf l}_{0}=\overline{{\bf c}_{1}{\bf c}_{0}}$ (top), ${\bf l}_{1}=\overline{{\bf c}_{1}{\bf c}_{2}}$ (left), ${\bf l}_{2}=\overline{{\bf c}_{2}{\bf c}_{3}}$ (bottom) and ${\bf l}_{3}=\overline{{\bf c}_{3}{\bf c}_{0}}$ (right). With this notation, Algorithm 1 can be used at time $n$ to generate a sample ${\bf x}_{n}=[{\bf r}_{n}^{\top},{\bf v}_{n}^{\top}]^{\top}$ from ${\bf x}_{n-1}=[{\bf r}_{n-1}^{\top},{\bf v}_{n-1}^{\top}]^{\top}$ . It accounts for the scenario in which the target hits one of the walls and deals with it by means of the law of reflection [43].

We are implicitly assuming that ${\bf r}_{n}\in{\mathcal{R}}$ in step 5 above. If this is not the case, i.e., $r_{n}\notin{\mathcal{R}}$ , then steps 3–5 can be run again to implement a second reflection.

References

[1]

M. Jansson, B. Wahlberg, A linear regression approach to state-space subspace system identification, Signal Processing 52 (2) (1996) 103–129.

[2]

G. Storvik, Particle filters for state-space models with the presence of unknown static parameters, IEEE Transactions Signal Processing 50 (2) (2002) 281–289.

[3]

C. Andrieu, A. Doucet, Online expectation-maximization type algorithms for parameter estimation in general state space models, in: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 6, IEEE, 2003, pp. VI–69.

[4]

J. Ding, Y. Shi, H. Wang, F. Ding, A modified stochastic gradient based parameter estimation algorithm for dual-rate sampled-data systems, Digital Signal Processing 20 (4) (2010) 1238–1247.

[5]

F. Ding, Y. Gu, Performance analysis of the auxiliary model-based stochastic gradient parameter estimation algorithm for state-space systems with one-step state delay, Circuits, Systems, and Signal Processing 32 (2) (2013) 585–599.

[6]

J. Kokkala, S. Särkkä, Combining particle MCMC with Rao-Blackwellized Monte Carlo data association for parameter estimation in multiple target tracking, Digital Signal Processing 47 (2015) 84–95.

[7]

C. Andrieu, A. Doucet, R. Holenstein, Particle Markov chain Monte Carlo methods, Journal of the Royal Statistical Society B 72 (2010) 269–342.

[8]

E. Koblents, J. Míguez, A population monte carlo scheme with transformed weights and its application to stochastic kinetic models, Statistics and Computing 25 (2) (2015) 407–425.

[9]

D. Crisan, J. Miguez, Nested particle filters for online parameter estimation in discrete-time state-space markov models, arXiv 1308.1883v3 [stat.CO].

[10]

N. Kantas, A. Doucet, S. S. Singh, J. M. Maciejowski, N. Chopin, On particle methods for parameter estimation in state-space models, Statistical Science 30 (2015) 328–351.

[11]

J. Olsson, T. Ryden, Rao-Blackwellization of particle Markov chain Monte Carlo methods using forward filtering backward sampling, IEEE Transactions on Signal Processing 59 (10) (2011) 4606–4619.

[12]

T. Vu, B.-N. Vo, R. Evans, A particle marginal Metropolis-Hastings multi-target tracker, IEEE Transactions on Signal Processing 62 (15) (2014) 3953–3964.

[13]

J. Kwon, R. Dragon, L. Van Gool, Joint tracking and ground plane estimation, IEEE Signal Processing Letters 23 (11) (2016) 1514–1517.

[14]

J. Ala-Luhtala, N. Whiteley, K. Heine, R. Piché, An introduction to twisted particle filters and parameter estimation in non-linear state-space models, IEEE Transactions on Signal Processing 64 (18) (2016) 4875–4890.

[15]

W. J. Fitzgerald, Markov chain Monte Carlo methods with applications to signal processing, Signal Processing 81 (1) (2001) 3–18.

[16]

N. Gordon, D. Salmond, A. F. M. Smith, Novel approach to nonlinear and non-Gaussian Bayesian state estimation, IEE Proceedings-F 140 (2) (1993) 107–113.

[17]

A. Doucet, N. de Freitas, N. Gordon (Eds.), Sequential Monte Carlo Methods in Practice, Springer, New York (USA), 2001.

[18]

A. Doucet, S. Godsill, C. Andrieu, On sequential Monte Carlo Sampling methods for Bayesian filtering, Statistics and Computing 10 (3) (2000) 197–208.

[19]

P. M. Djurić, J. H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. F. Bugallo, J. Míguez, Particle filtering, IEEE Signal Processing Magazine 20 (5) (2003) 19–38.

[20]

O. Cappé, S. J. Godsill, E. Moulines, An overview of existing methods and recent advances in sequential Monte Carlo, Proceedings of the IEEE 95 (5) (2007) 899–924.

[21]

C. P. Robert, G. Casella, Monte Carlo Statistical Methods, Springer, 2004.

[22]

O. Cappé, A. Gullin, J. M. Marin, C. P. Robert, Population monte carlo, Journal of Computational and Graphical Statistics 13 (4) (2004) 907–929.

[23]

N. Chopin, P. E. Jacob, O. Papaspiliopoulos, SMC2: an efficient algorithm for sequential analysis of state space models, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[24]

M. Hong, M. F. Bugallo, P. M. Djuric, Joint model selection and parameter estimation by population Monte Carlo simulation, IEEE Journal of Selected Topics in Signal Processing 4 (3) (2010) 526–539.

[25]

L. Martino, V. Elvira, D. Luengo, J. Corander, An adaptive population importance sampler: Learning from uncertainty, IEEE Transactions on Signal Processing 63 (16) (2015) 4422–4437.

[26]

M. F. Bugallo, L. Martino, J. Corander, Adaptive importance sampling in signal processing, Digital Signal Processing 47 (2015) 36–49.

[27]

V. Elvira, L. Martino, D. Luengo, M. F. Bugallo, Improving population monte carlo: Alternative weighting and resampling schemes, Signal Processing 131 (2017) 77–91.

[28]

N. Chopin, A sequential particle filter method for static models, Biometrika 89 (3) (2002) 539–552.

[29]

P. Del Moral, A. Doucet, A. Jasra, Sequential Monte Carlo samplers, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (3) (2006) 411–436.

[30]

A. Kong, J. S. Liu, W. H. Wong, Sequential imputations and Bayesian missing data problems, Journal of the American Statistical Association 9 (1994) 278–288.

[31]

V. Elvira, L. Martino, D. Luengo, M. F. Bugallo, Efficient multiple importance sampling estimators, IEEE Signal Processing Letters 22 (10) (2015) 1757–1761.

[32]

A. Bain, D. Crisan, Fundamentals of Stochastic Filtering, Springer, 2008.

[33]

B. D. O. Anderson, J. B. Moore, Optimal Filtering, Englewood Cliffs, 1979.

[34]

G. Kitagawa, Monte Carlo filter and smoother for non-Gaussian nonlinear state-space models, J. Comput. Graph. Statist. 1 (1996) 1–25.

[35]

R. Douc, O. Cappé, E. Moulines, Comparison of resampling schemes for particle filtering, in: Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005, pp. 64–69.

[36]

P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer, 2004.

[37]

J. Míguez, D. Crisan, P. M. Djurić, On the convergence of two sequential Monte Carlo methods for maximum a posteriori sequence estimation and stochastic global optimization, Statistics and Computing 23 (1) (2013) 91–107.

[38]

C. Andrieu, G. Roberts, The pseudo-marginal approach for efficient Monte Carlo computations, Annals of Statistics 37 (2009) 697–725.

[39]

A. Doucet, N. de Freitas, N. Gordon, An introduction to sequential Monte Carlo methods, in: A. Doucet, N. de Freitas, N. Gordon (Eds.), Sequential Monte Carlo Methods in Practice, Springer, 2001, Ch. 1, pp. 4–14.

[40]

D. Crisan, J. Miguez, G. Ríos, A simple scheme for the parallelisation of particle filters and its application to the tracking of complex stochastic systems, arXiv arXiv:1407.8071v2 [stat.CO].

[41]

E. Koblents, J. Miguez, M. A. Rodriguez, A. M. Schmidt, A nonlinear population Monte Carlo scheme for the Bayesian estimation of parameters of $\alpha$ -stable distributions, Computational Statistics and Data Analysis 95 (2016) 57–74.

[42]

D. Crisan, J. Miguez, Particle-kernel estimation of the filter density in state-space models, Bernoulli 20 (4) (2014) 1879–1929.

[43]

G. Farin, D. Hansford, Practical linear algebra: A geometry toolbox, CRC Press, 2013.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Jansson, B. Wahlberg, A linear regression approach to state-space subspace system identification, Signal Processing 52 (2) (1996) 103–129.
2[2] G. Storvik, Particle filters for state-space models with the presence of unknown static parameters, IEEE Transactions Signal Processing 50 (2) (2002) 281–289.
3[3] C. Andrieu, A. Doucet, Online expectation-maximization type algorithms for parameter estimation in general state space models, in: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 6, IEEE, 2003, pp. VI–69.
4[4] J. Ding, Y. Shi, H. Wang, F. Ding, A modified stochastic gradient based parameter estimation algorithm for dual-rate sampled-data systems, Digital Signal Processing 20 (4) (2010) 1238–1247.
5[5] F. Ding, Y. Gu, Performance analysis of the auxiliary model-based stochastic gradient parameter estimation algorithm for state-space systems with one-step state delay, Circuits, Systems, and Signal Processing 32 (2) (2013) 585–599.
6[6] J. Kokkala, S. Särkkä, Combining particle MCMC with Rao-Blackwellized Monte Carlo data association for parameter estimation in multiple target tracking, Digital Signal Processing 47 (2015) 84–95.
7[7] C. Andrieu, A. Doucet, R. Holenstein, Particle Markov chain Monte Carlo methods, Journal of the Royal Statistical Society B 72 (2010) 269–342.
8[8] E. Koblents, J. Míguez, A population monte carlo scheme with transformed weights and its application to stochastic kinetic models, Statistics and Computing 25 (2) (2015) 407–425.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Analysis of a nonlinear importance sampling scheme for Bayesian parameter estimation in state-space models

Abstract

keywords:

1 Introduction

2 Background and problem statement

2.1 State-space model

2.2 The optimal filter and its Monte Carlo approximation

2.3 Problem statement

3 Algorithm

4 Analysis

4.1 Notation

4.2 Assumptions and a preliminary result

Assumption 1

Assumption 2

Remark 1

Lemma 1

4.3 Asymptotic convergence, error rates and exact approximation

Theorem 1

Remark 2

5 Computer simulations

5.1 State-space models

5.1.1 Target dynamics

5.1.2 Observations

5.1.3 Problem statement

5.2 Competing methods

5.3 Results

6 Conclusion

Acknowledgments

Appendix A Definition of function f(⋅)f(\cdot)f(⋅)

References

Appendix A Definition of function $f(\cdot)$