Moment-Based Variational Inference for Markov Jump Processes

Christian Wildner; Heinz Koeppl

arXiv:1905.05451·cs.LG·May 15, 2019

Moment-Based Variational Inference for Markov Jump Processes

Christian Wildner, Heinz Koeppl

PDF

TL;DR

This paper introduces a flexible moment-based variational inference framework for approximate smoothing in latent Markov jump processes, enabling efficient inference and parameter estimation in complex stochastic models.

Contribution

It presents a novel variational inference method that partitions transitions to express divergence in terms of moments, applicable to common jump processes and extended to parameter inference.

Findings

01

Effective approximation of smoothing in Markov jump processes.

02

Extension to parameter inference demonstrated on multiple examples.

03

Flexible framework adaptable to various jump process classes.

Abstract

We propose moment-based variational inference as a flexible framework for approximate smoothing of latent Markov jump processes. The main ingredient of our approach is to partition the set of all transitions of the latent process into classes. This allows to express the Kullback-Leibler divergence between the approximate and the exact posterior process in terms of a set of moment functions that arise naturally from the chosen partition. To illustrate possible choices of the partition, we consider special classes of jump processes that frequently occur in applications. We then extend the results to parameter inference and demonstrate the method on several examples.

Equations212

P (X (t + Δ t) = y ∣ X (t) = x) = Q (x, y, t) Δ t + o (Δ t)

P (X (t + Δ t) = y ∣ X (t) = x) = Q (x, y, t) Δ t + o (Δ t)

\frac{d}{d t} p (x, t) = y \neq = x \sum Q (y, x, t) p (y, t) - y \neq = x \sum Q (x, y, t) p (x, t) .

\frac{d}{d t} p (x, t) = y \neq = x \sum Q (y, x, t) p (y, t) - y \neq = x \sum Q (x, y, t) p (x, t) .

\frac{d}{d t} E [g (X (t)] = y \sum E [(g (y) - g (X (t)) Q (X (t), y, t)],

\frac{d}{d t} E [g (X (t)] = y \sum E [(g (y) - g (X (t)) Q (X (t), y, t)],

D [μ ∣∣ ν] = {\int lo g (\frac{d μ}{d ν}) d μ \infty μ ≪ ν, otherwise,

D [μ ∣∣ ν] = {\int lo g (\frac{d μ}{d ν}) d μ \infty μ ≪ ν, otherwise,

D [P^{X} ∣∣ P^{Z}] = \int_{0}^{T} x \sum p^{X} (x, t) y \neq = x \sum [Q^{Z} (x, y, t) - Q^{X} (x, y, t) - Q^{X} (x, y, t) lo g (\frac{Q ^{X} ( x , y , t )}{Q ^{Z} ( x , y , t )})] d t,

D [P^{X} ∣∣ P^{Z}] = \int_{0}^{T} x \sum p^{X} (x, t) y \neq = x \sum [Q^{Z} (x, y, t) - Q^{X} (x, y, t) - Q^{X} (x, y, t) lo g (\frac{Q ^{X} ( x , y , t )}{Q ^{Z} ( x , y , t )})] d t,

\tilde{Q} (x, y, t) = \frac{σ ( y , t )}{σ ( x , t )} Q (x, y) .

\tilde{Q} (x, y, t) = \frac{σ ( y , t )}{σ ( x , t )} Q (x, y) .

\frac{d}{d t} σ (x, t) = - y \neq = x \sum Q (x, y) (σ (y, t) - σ (x, t))

\frac{d}{d t} σ (x, t) = - y \neq = x \sum Q (x, y) (σ (y, t) - σ (x, t))

t ↗ t_{k} lim σ (x, t) = σ (x, t_{k}) p (y_{k} ∣ x)

t ↗ t_{k} lim σ (x, t) = σ (x, t_{k}) p (y_{k} ∣ x)

Z^{*} = ar g Z \in U min D [P^{Z} ∣∣ P^{\tilde{X}}]

Z^{*} = ar g Z \in U min D [P^{Z} ∣∣ P^{\tilde{X}}]

D [P^{Z} ∣∣ P^{\tilde{X}}] = D [P^{Z} ∣∣ P^{X}] - k = 1 \sum n E [lo g p (y_{k} ∣ Z (t_{k})] + const

D [P^{Z} ∣∣ P^{\tilde{X}}] = D [P^{Z} ∣∣ P^{X}] - k = 1 \sum n E [lo g p (y_{k} ∣ Z (t_{k})] + const

Q^{Z} (x, y, t) = λ (x, y, t) Q (x, y),

Q^{Z} (x, y, t) = λ (x, y, t) Q (x, y),

λ (x, y) = λ_{i} (t) for (x, y) \in Π_{i},

λ (x, y) = λ_{i} (t) for (x, y) \in Π_{i},

φ_{i} (t) = (x, y) \in Π_{i} \sum Q (x, y) p^{Z} (x, t),

φ_{i} (t) = (x, y) \in Π_{i} \sum Q (x, y) p^{Z} (x, t),

\tilde{Q}_{i} (x) = y \sum 1_{Π_{i}} (x, y) Q (x, y)

\tilde{Q}_{i} (x) = y \sum 1_{Π_{i}} (x, y) Q (x, y)

φ_{i} (t) = E [\tilde{Q}_{i} (Z (t))] .

φ_{i} (t) = E [\tilde{Q}_{i} (Z (t))] .

D [P^{Z} ∣∣ P^{X}] = L [φ, λ]

D [P^{Z} ∣∣ P^{X}] = L [φ, λ]

L [φ, λ] = i = 1 \sum r \int_{0}^{T} φ_{i} (t) (1 - λ_{i} (t) + λ_{i} (t) lo g λ_{i} (t)) d t .

L [φ, λ] = i = 1 \sum r \int_{0}^{T} φ_{i} (t) (1 - λ_{i} (t) + λ_{i} (t) lo g λ_{i} (t)) d t .

\frac{d}{d t} φ_{i} (t) = j = 1 \sum r λ_{j} (t) E [y \sum (\tilde{Q}_{i} (y) - \tilde{Q}_{i} (Z (t))) 1_{Π_{j}} (Z (t), y) Q (Z (t), y)] .

\frac{d}{d t} φ_{i} (t) = j = 1 \sum r λ_{j} (t) E [y \sum (\tilde{Q}_{i} (y) - \tilde{Q}_{i} (Z (t))) 1_{Π_{j}} (Z (t), y) Q (Z (t), y)] .

\frac{d}{d t} φ (t) = A (λ (t)) φ (t) + B (λ (t)) \tilde{φ} (t)

\frac{d}{d t} φ (t) = A (λ (t)) φ (t) + B (λ (t)) \tilde{φ} (t)

\frac{d}{d t} φ (t) = f (λ (t), φ (t)),

\frac{d}{d t} φ (t) = f (λ (t), φ (t)),

minimize subject to L [λ, φ] - F [φ] \frac{d}{d t} φ (t) = f (λ (t), φ (t)) .

minimize subject to L [λ, φ] - F [φ] \frac{d}{d t} φ (t) = f (λ (t), φ (t)) .

F [φ] = k = 1 \sum n E [lo g p (y_{k} ∣ Z (t_{k})]

F [φ] = k = 1 \sum n E [lo g p (y_{k} ∣ Z (t_{k})]

S (t) = j, for X (t) = x, x \in Π_{j} .

S (t) = j, for X (t) = x, x \in Π_{j} .

Π_{ij} = {(x, y) \in {1, \dots, s}^{2} : x \in Π_{i}, y \in Π_{j}} .

Π_{ij} = {(x, y) \in {1, \dots, s}^{2} : x \in Π_{i}, y \in Π_{j}} .

φ_{ij} (t) = x, y \sum 1_{Π_{i}} (x) 1_{Π_{j}} (y) Q (x, y) q (x, t) .

φ_{ij} (t) = x, y \sum 1_{Π_{i}} (x) 1_{Π_{j}} (y) Q (x, y) q (x, t) .

\tilde{Q}_{ij} (x) = 1_{Π_{i}} (x) y \sum 1_{Π_{j}} (y) Q (x, y)

\tilde{Q}_{ij} (x) = 1_{Π_{i}} (x) y \sum 1_{Π_{j}} (y) Q (x, y)

\tilde{Q}_{ij} (x) = {\tilde{Q}_{ij} 0 x \in Π_{i} otherwise,

\tilde{Q}_{ij} (x) = {\tilde{Q}_{ij} 0 x \in Π_{i} otherwise,

\frac{d}{d t} φ_{ij} = \tilde{Q}_{ij} l \neq = i \sum λ_{l i} (t) φ_{i l} (t) - l \neq = i \sum λ_{i l} (t) φ_{i l} (t)

\frac{d}{d t} φ_{ij} = \tilde{Q}_{ij} l \neq = i \sum λ_{l i} (t) φ_{i l} (t) - l \neq = i \sum λ_{i l} (t) φ_{i l} (t)

Π_{z, z^{'}} = {(x, x^{'}) \in Ψ : T (x) = z, T (x^{'}) = z^{'}}

Π_{z, z^{'}} = {(x, x^{'}) \in Ψ : T (x) = z, T (x^{'}) = z^{'}}

Q (x, y) = {Q_{j k}^{i} (x) 0 x_{i} = j, y_{i} = k, x_{l} = y_{l} for l \neq = i otherwise

Q (x, y) = {Q_{j k}^{i} (x) 0 x_{i} = j, y_{i} = k, x_{l} = y_{l} for l \neq = i otherwise

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Moment-Based Variational Inference for Markov Jump Processes

Christian Wildner

Heinz Koeppl

Abstract

We propose moment-based variational inference as a flexible framework for approximate smoothing of latent Markov jump processes. The main ingredient of our approach is to partition the set of all transitions of the latent process into classes. This allows to express the Kullback-Leibler divergence between the approximate and the exact posterior process in terms of a set of moment functions that arise naturally from the chosen partition. To illustrate possible choices of the partition, we consider special classes of jump processes that frequently occur in applications. We then extend the results to parameter inference and demonstrate the method on several examples.

Machine Learning, ICML

1 Introduction

Markov jump processes (MJPs) are popular modelling tools in a number of domains including physics, biology, mathematical finance and chemistry (Anderson, 1991). In many cases, these models are used for planning and prediction. To achieve this, the model parameters have to be calibrated from sparse and noisy measurements. Mathematically, this leads to the problem of inference for stochastic processes. Unfortunately, inference in such scenarios is notoriously difficult because parameter inference requires estimation of the latent Markov chain given the observations as an intermediate step. This is especially true in scenarios with unbounded state-spaces such as population models. In recent years, population models have become popular in systems and synthetic biology to describe the intrinsic stochasticity of bio-chemical reactions in a cellular environment (Anderson & Kurtz, 2015).

Algorithms for latent state estimation are often based on the sequential Monte Carlo framework (Doucet & Johansen, 2011) or specialized MCMC approaches (Rao & Teh, 2013). Alternatively, if the state space is not too large, one can also apply a continuous version of the classical forward-backward algorithm for hidden Markov models. In the context of parameter learning, many approaches are likelihood-based and apply a form of the EM-algorithm (Bladt & Sørensen, 2005; Metzner et al., 2007; Liu et al., 2015). Bayesian approaches to joint estimation of latent states and parameters usually focus on the particle Markov chain Monte Carlo framework which has inspired many applications (Andrieu et al., 2010; Golightly & Wilkinson, 2011; Frigola et al., 2013).

While variational inference is ubiquituos in machine learning, most applications focus on probabilistic graphical models (Blei et al., 2017). An important contribution in the domain of continuous time dynamic models is the work of Archambeau et al. (2007) who developed a Gaussian approximation for stochastic differential equations. Opper & Sanguinetti (2008) proposed the first variational method for MJPs on the process level. The core idea of their approach is to approximate a coupled multi-component process by a product of independent component processes. This idea was adapted by Cohn et al. (2010) to continuous time Bayesian networks. More recently, Zhang et al. (2017) combined uniformization with the classical variational method for graphical models.

In this work, we propose a general procedure to derive variational inference algorithms for arbitrary MJPs on a countable state space. While based on the same divergence functional as the approach in (Opper & Sanguinetti, 2008), our approximation is conceptually different. The key step in our approach is to partition the state of all transitions into a number of predefined classes. We then express the divergence function in terms of natural moment functions that are induced by the partition. This allows to approximate the inference problem by an optimal control problem with a complexity controlled by the chosen partition for the transitions. Another difference to the mean-field approach is that we construct our variational family as a modification of the prior process, thus overcoming issues with absolute continuity that can arise from a product approximation.

The remaining part of the paper is organized as follows. Section 2 summarizes background material on MJPs. In Section 3 we construct our variational smoothing algorithm and discuss a few special process classes. Section 4 discusses an extension to latent parameter inference. Finally, in Section 5 we demonstrate the methods on several examples.

2 Background

2.1 Markov Jump Processes

We consider a Markov jump process $X(t)$ on a countable state space $\mathcal{X}$ over a finite interval $[0,T]$ . As is well-known, such a process is fully specified by an initial measure $p_{0}:\mathcal{X}\rightarrow[0,1]$ and a transition function $Q:\mathcal{X}\times\mathcal{X}\times[0,T]\rightarrow[0,\infty)$ . The transition function defines the infinitesimal transition probabilities in the sense that for $y\neq x$ we have

[TABLE]

For any $t>0$ the marginal probability distribution $p(x,t)=P(X(t)=x)$ satisfies the master equation

[TABLE]

Let $g:\mathcal{X}\rightarrow\mathbb{R}$ be a function with finite expectation $\mathsf{E}[g(X(t)]<\infty$ for $t\in[0,T]$ . Then the expectation obeys a differential equation of the form

[TABLE]

which can usually not be expressed in terms of $\mathsf{E}[g(X(t)]$ alone (Ethier & Kurtz, 2005).

2.2 Path Divergence

For two probability measures $\mu$ and $\nu$ on a common probability space, the Kullback-Leibler (KL) divergence from $\mu$ to $\nu$ is defined as

[TABLE]

where $\frac{d\mu}{d\nu}$ is the Radon-Nikodym derivative of measures and $\mu\ll\nu$ refers to $\mu$ being absolutely continuous with respect to $\nu$ .

Consider two MJPs $X$ and $Z$ on $\mathcal{X}$ over $[0,T]$ with possibly time-dependent transition functions $Q^{X},Q^{Z}$ that share the same initial distribution. Let $P^{X}$ , $P^{Z}$ denote the measures over the sample path space induced by $X$ and $Z$ . Then the divergence on the level of sample paths is given by

[TABLE]

where and $p^{X}(\cdot,t)$ refers to the marginal distribution of $X(t)$ . For this expression to be finite it is required that $Q^{X}(x,y,t)$ is zero whenever $Q^{Z}(x,y,t)$ is zero. A rigorous proof of (2) can be obtained using Girsanov’s theorem for counting processes (Kipnis & Landim, 1998). Alternative derivations are given in (Cohn et al., 2010) based on explicit integration over waiting times or in (Opper & Sanguinetti, 2008) using a limit of discrete-time Markov chains for vanishing time steps.

2.3 Conditional Processes

From here on, we focus on a process $X$ with time-independent transition function $Q$ . Often the true state $X(t)$ of the system is only accessible through a $p$ -dimensional observation process $Y.$ We focus on the common scenario where observations occur at fixed discrete times $0\leq t_{1},\ldots,t_{n}\leq T$ that are not necessarily equidistant. Such an observation process may be written as a single matrix-valued random variable $Y=(Y(t_{1}),\ldots,Y(t_{n}))$ . We assume that the observation at time $t_{k}$ only depends on the latent state at time $t_{k}$ and denote the conditional density of the observations as $p(y_{k}\mid X(t_{k}))$ .

From a Bayesian perspective, we are interested in the conditional probability distribution $P(X(t)=x\mid y)$ for a particular realization $y\in\mathbb{R}^{p\times n}$ . For this type of observation, it can be shown that the smoothed process $\tilde{X}$ associated with $P(X(t)=x\mid y)$ is a non-homogeneous MJP (Huang et al., 2016). The corresponding transition function $\tilde{Q}$ is now time dependent and given by

[TABLE]

Here $Q$ is the transition function of the prior process $X$ and $\sigma(x,t)=p(y_{k},\ldots,y_{n}\mid X(t)=x)$ with $k=\min_{k\in\mathbb{N}}\{t_{k}>t\}$ . It can be shown that $\sigma$ obeys the backward equation

[TABLE]

that has to be solved for the terminal condition $\sigma(x,T)=1$ for all $x\in\mathcal{X}$ and jump conditions

[TABLE]

for all $x\in\mathcal{X}$ at the observation times $t_{k}$ .

2.4 Variational Smoothing

Since the smoothed process $\tilde{X}$ as defined in Subsection 2.3 is typically intractable, we aim to find the best process level approximation $Z$ within a simpler variational class $\mathcal{U}$ . Following the usual variational inference framework (Jordan et al., 1999; Blei et al., 2017), we formalize this problem by choosing $Z^{*}\in\mathcal{U}$ such that

[TABLE]

minimizes the path divergence to the true posterior process. The functional optimization problem (5) still depends on the true posterior process $\tilde{X}$ . As shown in the supplement, we can rewrite the objective function in (5) as

[TABLE]

which is independent of the exact functional form of the posterior intensities.

3 Moment-Based Variational Smoothing

3.1 Variational Process Family

In our scenario, the smoothed process $\tilde{X}$ is an MJP with the modified intensities (3). Intuitively, the modified transition rate accounts for deviations of the posterior from the prior process. If the observations suggest that transition $x\rightarrow y$ has occurred more often than expected up to a certain time point, the posterior process will scale up the intensity to match the observation. The core idea of our approximation is to mimic this behavior with a process that is simpler than the full posterior. Thus, we define the variational transition function $Q^{Z}$ as

[TABLE]

where we have introduced the variational scaling factor $\lambda$ . Note that (7) corresponds to an overparametrization of the variational problem. Solving (5) with the variational family induced by (7) recovers the true posterior transition function (3). In order to achieve a reduction in complexity, we group transitions together such that all transition within the group share a single scaling factor.

To make this more precise, let $\Psi=\{(x,y)\in\mathcal{X}\times X:x\neq y,Q(x,y)>0\}$ be the set of all possible transitions of the prior process. Consider a partition $\Pi=\{\Pi_{i}\subset\Psi:i=1,\ldots,r\}$ such that $\Pi_{i}\cap\Pi_{j}=\emptyset$ for $i\neq j$ and $\bigcup_{i=1}^{r}\Pi_{i}=\Psi$ . The partition $\Pi$ induces a variational family by restricting the variational scaling factor $\lambda$ to be of the form

[TABLE]

where the $\lambda_{i}:[0,T]\rightarrow[0,\infty)$ are from a suitably regular class of functions. Systematic ways to choose such a transition space partition will be discussed in Section 3.3 and Section 4.

3.2 Moment-Based Divergence

Define the natural moment functions associated with the partition $\Pi$ as

[TABLE]

where $p^{Z}$ refers to the marginal distribution of the variational process $Z(t)$ . It is also convenient to introduce the function $\tilde{Q}_{i}$ as

[TABLE]

with the indicator function $\mathbf{1}_{\Pi_{i}}(x,y)=1$ iff the transition $(x,y)$ is contained in $\Pi_{i}$ . Intuitively, $\tilde{Q}_{i}(x)$ corresponds to the total exit rate from $x$ to states within transition class $i$ . We can now express the natural moment functions as

[TABLE]

Consider the divergence from the variational process to the prior process $D[P^{Z}\,||\,P^{X}]$ in (6). Exploiting (7), (8) and (9) we obtain

[TABLE]

where $\varphi$ and $\lambda$ are vector-valued functions with components $\lambda_{i}$ and $\varphi_{i}$ , respectively. The functional $L$ is defined as

[TABLE]

Since the variational process $Z$ is a non-homogeneous MJP, the moment functions $\varphi$ and the scaling factors $\lambda$ are not independent, but obey a differential equation of the form (1)

[TABLE]

Our goal is to express the r.h.s. of (10) in terms of $\varphi(t)$ , $\lambda(t)$ to obtain a closed-form description of the variational problem in terms these quantities. In general, this is not possible and we obtain an equation of the form

[TABLE]

where $A$ and $B$ are suitable matrix valued functions and $\tilde{\varphi}$ is a collection of higher order moment functions that cannot be reduced to $\varphi$ . In this case, we can apply an additional approximation in the form of moment closure as often used in the analysis of non-linear stochastic dynamical systems (Kuehn, 2016). Generally speaking, a moment closure scheme is a function $\psi$ such that we may write $\tilde{\varphi}(t)=\psi(\varphi(t))$ . Thus, we arrive at an equation of the form

[TABLE]

where $f$ depends on the process and the applied closure scheme. We can now recast the variational inference problem (5) into a non-linear optimal control problem of the form

[TABLE]

Here, $F$ refers to the observation contribution in (6), i.e.

[TABLE]

which has to be understood as a functional of the natural moments $\varphi$ . Before turning to special classes of MJPs, we end this section with a remark on moment closure. When the moment equations (11) are naturally closed (i.e. $B=0$ ), than the control problem (13) corresponds exactly to the process level formulation (5) and the minimum of $L$ corresponds to the usual evidence lower bound. When we use moment closure, there is no global process equivalent to (13) and the lower bound property is lost. This effect has also been observed for other variational approximations that go beyond product mean-field, for example cluster variational methods (Yedidia et al., 2000). In our case, we can recover a valid process level result using the solution $\lambda^{*}$ in (7). Generating sample paths from the so defined approximate posterior gives a means to analyze the accuracy of the moment closure approximation.

3.3 Special Classes

In this section, we consider special classes of structured MJPs that give rise to a natural partition of the transition space.

State Space Lumping.

Consider an MJP $X$ with countable state space $\mathcal{X}$ and let $\{\Pi_{1},\ldots,\Pi_{s}\}$ be a partition of the state space (in contrast to the partition of the transition space). The lumped process can now be defined as

[TABLE]

A partition of the state space naturally induces a partition of the transition space given by

[TABLE]

This partition subsumes all transitions of the lumped process. Accordingly, the natural moment functions become

[TABLE]

It is also interesting to observe that

[TABLE]

which is the total transition rate from $x$ in $\Pi_{i}$ to $\Pi_{j}$ . In the special case that the MJP is lumpable with respect to the partition $\Pi$ (Rubino & Sericola, 1993), we obtain

[TABLE]

where $\tilde{Q}_{ij}=\sum_{y\in\Pi_{j}}Q(x,y)$ which is identical for all $x\in\Pi_{i}$ . Inserting this into the dynamic equation yields

[TABLE]

Dividing both sides by $\tilde{Q}_{ij}$ , we observe that (14) is equivalent to the master equation of the lumped process $S$ . Accordingly, our variational approach will recover the exact posterior of the lumped process. While most examples of practical interest are not lumpable, (14) provides a way to exploit known approximate lumping schemes for inference purposes by understanding the approximate lumping as a moment closure scheme.

Data Driven Partitioning.

In many cases, the observations $Y$ do not depend on the full latent state $X(t)$ but rather on a summary statistic $S(t)=T(X(t))$ for $T:\mathcal{X}\rightarrow\mathcal{S}$ . In general, $S(t)$ will be a non-Markovian jump process on $\mathcal{S}$ . However, it induces a natural partition of the form

[TABLE]

meaning that we group all transitions of $X$ that lead to the same state change of $S$ in $\mathcal{S}$ .

Agent-Based MJPs.

An agent-based MJP $X$ on $\mathcal{X}=\mathbb{\{}1,\ldots,m\}^{d}$ describes the behavior of $d$ coupled agents where each agent may be in one of $m$ states. In addition, the probability of two agents changing the state at the same time is assumed to be zero. In terms of the transition function, we obtain

[TABLE]

where $Q^{i}_{jk}$ is the transition rate of agent $i$ from state $j$ to state $k$ as a function of the state of the other agents. This is the class of processes targeted by the existing mean-field approaches for jump processes (Opper & Sanguinetti, 2008; Cohn et al., 2010). Since our variational family is constructed by modifying the prior process, we cannot recover the product-type approximation used in the classical approaches. We can, however, mimic its behavior by subsuming all transitions that produce the same change of the same agent, i.e.

[TABLE]

Using (15), we can evaluate the corresponding moment functions as

[TABLE]

which is quite intuitive, since it is the transitions rate of agent $i$ to go from $j$ to $k$ averaged over all configurations of the remaining agents.

Population-Based MJPs.

A population-based MJP $X$ on $\mathcal{X}=\mathbb{N}_{0}^{d}$ describes the stochastic evolution of the abundances of $d$ species over time. While they seem similar to agent models from the previous section, population models usually have an unbounded state space. In addition, one event may affect several species at the same time. This fact is formalized by a set of change vectors $\{v_{i}\in\mathbb{Z}_{0}^{d}:i=1,\ldots,r\}$ and a transition function of the form

[TABLE]

If there are change vectors that affect several species, the product mean-field approach is ill-suited for this type of system, since a product of independent processes will not be absolutely continuous with respect to the prior process. In our framework, a natural choice of partition for this type of model is to combine all transitions that correspond to the same change vector

[TABLE]

The corresponding natural moment functions become

[TABLE]

The time-evolution of $\varphi$ is given by

[TABLE]

For systems where $h_{i}$ are linear functions of the state, this equation is closed in $\varphi_{i}$ . An example of this class will be discussed later.

3.4 Minimizing the Divergence.

We now introduce an algorithm to find approximately optimal variational scaling functions $\lambda^{*}$ . We do this following the indirect approach from optimal control (Bonnard & Caillau, 2006). The idea here is to transform the constrained problem (13) into an unconstraint problem by introducing the Lagrangian

[TABLE]

where $\eta$ is a vector-valued Lagrange multiplier function. By taking the functional derivative with respect to $\varphi_{i}$ we obtain

[TABLE]

valid in between the observations. At the point of the observations, the functional $F$ will induce jump conditions for $\eta$ given by

[TABLE]

Similarly, differentiation with respect to the scaling factor $\lambda_{i}$ yields an algebraic constraint of the form

[TABLE]

In the classical mean-field algorithm, the scaling functions $\lambda$ can be expressed in terms of $\eta$ by solving (A.12). The control problem is then solved using a form of the forward-backward sweep method (Mcasey et al., 2012). In our framework, it is generally not possible to eliminate the scaling factors. A variation of the forward-backward sweep explicitly including the scaling factors turned out to be unstable in our experiments.

We therefore propose a gradient-based algorithm based on the following argument. The moment functions $\varphi$ are fully determined by the scaling factors $\lambda$ and an initial condition. Hence, we may understand $L$ as a functional of $\varphi$ only. In this case, the r.h.s. of (A.12) corresponds to the gradient of $L$ with respect to $\varphi$ . Since the raw gradient updates obtained this way do not work too well, we take advantage of the probabilistic nature of the objective function to derive an analogue to the natural gradient (Amari, 1998). We do this by performing an expansion of the path divergence up to second order in $\lambda$ . From this, we obtain updates of the form

[TABLE]

with step size $h$ and the and the natural gradient

[TABLE]

The updates (A.14), (20) ensure that we take small steps on the manifold defined by the variational family and improve convergence significantly. The resulting algorithm is summarized in Algorithm 1. A more detailed discussion including a derivation of the natural gradient can be found in the supplement Section 3.

4 Parameter Inference

Suppose now that the transitions function $Q$ of the prior process depends on a collection of parameters $\theta$ . Using the usual framework of the variational EM algorithm, we can find an approximate maximum likelihood estimate by understanding the objective function $L[\lambda,\theta]$ as a function of the scaling factors and the parameters and then minimizing iteratively with respect to both arguments. In general, not much can be said about the structure of the resulting optimization problem. In the following, we focus on a special case that allows for closed form parameter updates.

Let $\Pi$ be a partition of the transition space such that the prior process has parametrized transition function of the form

[TABLE]

where $c_{i}$ and $h$ are known functions with $c_{i}$ differentiable with respect to $\theta$ . We proceed by defining our variational family on $\Pi$ by setting

[TABLE]

Note that (21) slightly differs from the ansatz introduced in Section 3.1 in the fact that it only contains a part of the prior intensity function. As a consequence, the variational family remains independent of the parameters $\theta$ . Setting

[TABLE]

we can reuse the formalism introduced in section 3.2 to obtain stationarity conditions for $\lambda$ . In addition, for fixed $\lambda$ , $\varphi$ we obtain optimality conditions for the parameters of the form

[TABLE]

where we have introduced the summary statistics

[TABLE]

Intuitively, $H_{i}$ corresponds to the expected number of transitions in class $i$ during $[0,T]$ . In the case where we have a single parameter per transition class, i.e. $c_{i}(\theta)=\theta_{i}$ , we get

[TABLE]

As an alternative to the EM approach, we may also follow a Bayesian framework, i.e. by assuming a prior $p(\theta)$ . Choosing a variational ansatz that factorizes over parameters and latent trajectories, we get an exponential family variational parameter distribution of the form

[TABLE]

If the prior is a gamma distribution and the we have a single parameter per transitions class, (22) will also have the form of a gamma distribution.

5 Examples

In this section, we apply our method to three examples. We focus on models of the population type for which inference is notoriously difficult due to the unbounded state space. First we consider a linear birth death process. This simple example allows analytic treatment and provides some intuition. Next, we numerically study a stochastic gene expression model of the type that is frequently used in systems biology. As a third example, we consider a stochastic predator prey model to demonstrate our method in combination with moment closure.

5.1 Birth Death Process

Let $\mathcal{X}=\mathbb{N}_{0}$ , $X_{0}\in\mathcal{X}$ and $c_{1},c_{2}>0$ . For the transition vectors $v_{1}=1$ , $v_{2}=-1$ together with the intensity

[TABLE]

the resulting process $\{X(t),t\geq 0\}$ is known as the linear birth death process and can be seen as the simplest case of a population-type model. We therefore following the approach of choosing the partition $\Pi=\{\Pi_{1},\Pi_{2}\}$ with

[TABLE]

which induce the moment functions

[TABLE]

Since, $\varphi_{1}$ is constant, the moment dynamics are reduced to a single closed equation of the form

[TABLE]

Now suppose we consider the conditional process with endpoint condition $X(T)=0$ . For simplicity, let us also assume that $X_{0}=0$ . In this case, the backward equation can be integrated analytically. This allows to compute the variational scaling factors as

[TABLE]

Inserting this into (23) we obtain

[TABLE]

For this example, the true posterior intensities can be calculated analytically (Huang et al., 2016) leading to the same result. Hence, the variational approximation coincides with the exact smoothed process in this simple special case. A graph of the first moment and the corresponding standard deviation can be found in Fig. 1.

Now consider the same example where instead of a terminal condition we have obtained a single observation $Y_{T}=y$ with zero mean additive Gaussian noise of variance $\sigma^{2}$ . In this case, the objective function becomes more complicated, since the second order moment has to be included in order to describe the observations. Consequently, there seems to be no simple closed form expressions to characterize the variational posterior. We ran our algorithm for different values of $\sigma^{2}$ and verified empirically that the solution converges to the analytic expression for the exact terminal constraint case with $\sigma\rightarrow 0$ . Fig. 1 shows the mean of the prior process and the noise-free conditional process compared two two observations with different noise levels. As expected, the variational method seems to interpolate between the observation and the prior process.

5.2 Gene Expression System

Consider the following simple gene expression model consisting of a single gene that can switch between an inactive state $G^{0}$ and an active state $G_{1}$ with rates $c_{1}$ and $c_{2}$ . While the gene is active, mRNA is produced at rate $c_{3}$ (transcription). The mRNA is then processed by ribosomes to produce the target protein at rate $c_{5}$ (translation). Both mRNA and protein molecules undergo degradation reactions with rates $c_{4}$ and $c_{6}$ respectively.

The stochastic behavior of this system is modeled by a population MJP $X$ with state space $\mathcal{X}\subset\mathbb{N}_{0}^{3}$ where the components correspond to the numbers of active genes $X_{1}$ , mRNA molecules $X_{2}$ and protein molecules $X_{3}$ . The nonzero elements of the transition function $Q$ are given by

[TABLE]

In typical applications, the produced protein molecule is a fluorescent reporter. Measurements of the reporter concentration can be obtained from live-cell microscopy by integrating fluorescence intensity over the cross-section of the cell. Thus, these measurements are a scaled and noisy observation of the process component corresponding to the protein abundance. For simplicity, we assume that the noisy observations $Y_{1},\ldots,Y_{n}$ are conditionally Gaussian given the protein abundance. To construct our variational approximation, we choose the population-type partitioning and obtain closed form moment equations that can be found in Section 5 of the supplement.

The results for variational posterior mean and variance are shown in Fig. 2. The example demonstrates that the posterior mean concentrates on the true realization of the latent process. Thus, the method allows to infer latent state dynamics, in particular, the unobserved activation patterns of the gene are recovered quite well. To investigate this further, we used the posterior mean of the gene state to devise a detector (threshold: 0.5) and evaluated the true positive fraction $\alpha=0.94$ and false positive fraction $\beta=0.15$ from $N=100$ pooled trajectories. We note that false detections are mainly caused by short events. For example, if the gene is briefly inactive during a longer period of activity, this may not be visible in the protein activity and even less so in the observations.

Fig 2 also reveals a drawback of our method. From an exact smoothing, we would expect the variance of the posterior to be smaller at times of observations than in between the observations. We do not observe this behavior in our approach. The most likely reason for this is that a single time-dependent factor per reaction channel does not provide enough degrees of freedom to capture this effect. In the end, our approach reduces the full smoothing problem to a set of nine ordinary differential equations.

5.3 Predator-Prey Dynamics

To demonstrate our approach on a non-linear model, we consider a stochastic form of the classical predator prey interaction model. Here, a prey species $X_{1}$ and a predator species $X_{2}$ interact with each other as defined by the transition function $Q$ with non-zero elements

[TABLE]

Note that this system is unstable in the sense that, in the long run, we observe either extinction of both species or extinction of the predator and explosion of the prey population. Using our population-type partitioning, we obtain a set of moment equations that is not closed. In Supplement Sec. 5, we demonstrate how moment closure can be employed to obtain a closed system.

An application of the variational smoother to synthetic data of the predator prey model is given in Fig. 3. Due to the low abundances, it is possible to perform a truncation of the state space and solve the smoothing problem by integrating (4) backward in time. Comparing Fig. 3 to exact smoothing results reveals that the variance of the approximate posterior is higher than the variance of the exact smoothed process.

6 Discussion

On the basis of a partitioning of the transition space, we proposed a flexible framework to construct variational inference procedures for Markov jump processes. By defining the variational family as a modification of the prior process, our approach circumvents problems with absolute continuity that have plagued earlier product-type mean-field approaches. The main advantage of our method is, however, that the complexity of the approximation depends on the chosen partition of the transition space and can thus be adapted to the problem at hand. To get some insight into the choice of partition, we discussed classes of structured MJPs that give rise to a natural partitioning. As other variational procedures, moment based variational inference can be used for approximate parameter inference. An interesting observation here is that a suitable choice of partitioning may naturally lead to a gamma variational distribution of the parameters. Finally, we demonstrated the method on synthetic examples of the population type.

Our work opens several lines of future research. First of all, we envision applications to real data, e.g. in the context of systems and synthetic biology. On the theoretical side, it may be interesting to assess the effect of different partition and moment closure choices on the approximation quality. Finally, a more systematic comparison with related methods is required.

Acknowledgements

The authors thank the anonymous reviewers for their useful comments and suggestions. This work was supported by the European Research Council (ERC) within the CONSYN project, grant agreement number 773196, and by the Hessian research priority programme LOEWE within the project CompuGene.

Appendix A Additional Derivations

A.1 Decomposition of the Divergence

Here, we show that the divergence between an approximate process and a posterior process can be decomposed into the divergence of the approximate process and the prior process plus a contribution of of the observations. More specifically, we prove (6).

Lemma 1.

The divergence in (6) can be written as

[TABLE]

and hence the variational problem is independent of the exact functional form of the posterior intensities.

To see this, let $Z(t)$ be a MJP with marginal distribution $p^{Z}$ and time-dependent transition function $\lambda$ . To simplify the derivation, it is useful to introduce operators $\mathcal{L},\mathcal{L}^{\dagger}$ defined by their action on a function $f$ of a suitable class

[TABLE]

In analogy to(1) we get for a function $g:\mathcal{X}\times[0,T]\rightarrow\mathbb{R}$ ,

[TABLE]

where the second expression comes from the explicit time dependence of the function $g$ . Note that (A.2) can be obtained by inserting the distribution function $p^{Z}$ , using the product rule of differentiation and inserting the master equation obeyed by $p^{Z}$ . Following the notation of the main text, we will denote the time independent transition function of the prior process by $Q$ . Starting from the divergence of two time dependent MJPs (2) and inserting the posterior intensities yields

[TABLE]

with

[TABLE]

Now observe that by (A.1) the expectation within $J_{4}$ can be written as

[TABLE]

By addition of a zero, we can rewrite the term $J_{1}$ as

[TABLE]

The term in the numerator within the first expectation corresponds to the left hand side of the backward equation obeyed by $\sigma$ (c.f. (4)). Inserting this and using $\frac{d}{dx}\log f(x)=\frac{1}{f(x)}\frac{d}{dx}f(x)$ leads to

[TABLE]

Inserting the relations (A.6) and (A.4) into (A.3) and exploiting (A.2) we get

[TABLE]

Due to the jump conditions of the backward equation, the integral on the right hand side evaluates to

[TABLE]

The terms in the first line correspond to the contributions of the terminal and the initial time and can be ignored. For the terminal time, this follows directly from the terminal constraint $\sigma(x,T)=1$ for all $x$ . The contribution of the initial term, we observe that by definition of $\sigma$ we have

[TABLE]

which is constant with respect to the variational transition function $\lambda$ since the initial distribution is fixed. Note that in the usual language of variational inference $\log Z$ corresponds to the marginal log likelihood of the data. Finally, exploiting the reset conditions of the backward equation, we get

[TABLE]

In summary, we get

[TABLE]

which is in line with the usual decomposition of the KL divergence into the evidence $\log Z$ and a free energy contribution $L=\sum_{k=1}^{n}\mathsf{E}[\log p(y_{k}\mid Z(t_{k}))]-D\left[P^{Z}\,||\,P^{X}\right]$ (Blei et al., 2017).

A.2 Gradient Descent for MBVI

Maximum Principle

Our goal is to solve the variational problem in the form of the control problem as given in (13). However, here we consider a slightly more general scenario in which we do not solve for the natural moments $\varphi$ directly. Instead, we choose a collection of moment function $\psi$ such that the natural moments can be represented by

[TABLE]

for a suitable map $g$ . Note that we can always find such a collection of moments by choosing $\psi=\varphi$ and $g=\mathrm{Id}$ . In this formulation, control problem (13) becomes

[TABLE]

where we assume the initial conditions as fixed and known. We follow the indirect approach known from optimal control and variational calculus by introducing the Langrange multiplier functions (or co-states) $\eta_{i}$ , $i=1,\ldots,r$ to enforce the ODE relation between $\psi$ and $\lambda$ . We obtain the Lagrangian functional

[TABLE]

where we have stacked the $\eta_{i}$ into a single vector $\eta$ . Since the functional $F$ only acts on the discrete observation times, we will ignore it for a moment. Computing the functional derivative with respect to $\psi$ and setting it to zero leads to the co-state equations

[TABLE]

valid in between the observations. At the point of the observations, the functional $F$ will induce jump conditions for $\eta$ given by

[TABLE]

It is therefore crucial that we express the expected log likelihood with respect to the variational process in terms of the moment functions $\psi$ . If this is not naturally possible, we may enlarge the space of moment functions suitably. Next, consider the functional derivatives with respect to $\lambda_{i}$ leading to

[TABLE]

The equations (A.10), (A.11), (A.12) together with the ODE for $\psi$ from a set of necessary conditions for optimal solutions known as Pontryagin’s maximum (minimum) principle. Note that since the function $f$ is typically linear in $\lambda$ , it may be possible to solve (A.12) for $\lambda$ and eliminate it in (A.10). In general, the benefit of this questionable because the resulting ODE for $\eta$ is highly non-linear.

Gradient Based Optimization

A simple approach to solve to obtain a numerical solution from the maximum principle is the forward backward sweep (Mcasey et al., 2012). Here, one starts with an initial guess for $\lambda$ and solves the forward equation. The forward solution is then used to solve (A.10) backward in time. We may then solve (A.12) for $\lambda$ to obtain an update given the forward and backward solution. Iterating this procedure may lead to a stationary point of the functional.

In our applications, this procedure turned out to be highly unstable, probably because the updates are too large in the geometry of the probabilistic manifold defined by $\lambda$ . We therefore modified the procedure as follows. We keep the forward an backward solution steps, but instead of solving (A.12) for $\lambda$ , we understand the r.h.s of (A.12) as the gradient of the functional $L$ when considered as a function of $\lambda$ alone. More explicitly, we use

[TABLE]

Using (A.13) allows to apply a gradient descent type algorithm in the scaling factors $\lambda$ by using update steps of the form

[TABLE]

where $h$ is the step size. An algorithmic description of this procedure in form of pseudo code is given in Alg. 2.

Natural Gradient

It is well-known that gradient-based algorithms can perform poorly on manifolds. One solution to this is to incorporate the local geometry of the manifold via its metric tensor $G$ . Now consider a family of probability distributions $p_{\theta}$ parametrized by $\theta$ . Then the set of all $\theta$ spans a manifold whose geometric structure is given by the Fisher information matrix (Amari, 1998). This defines the so called natural gradient

[TABLE]

Discrete update steps using the natural gradient corresponds to an approximate steepest descent with respect to the local geometry of $p_{\theta}$ . In order to transfer this setting, we exploit a connection between the KL divergence and the fisher information metric

[TABLE]

that is the metric $G$ arises from second order expansion of the KL divergence in the parameter of the second argument. While the Fisher information matrix requires a finite dimensional parameter $\theta$ , the second order expansion of the KL divergence can be transferred to the path space setting. Thus, consider the variational family corresponding to a partition $\Pi$ of the transition space and consider variational scaling factors $\lambda$ , $\lambda^{\prime}$ . Then the divergence of $Z^{\lambda}$ and $Z^{\lambda^{\prime}}$ as in Section 3.1 of the main text can be written as

[TABLE]

where the $\varphi_{i}$ depend on $\lambda$ as they are defined as expected values with respect to $Z^{\lambda}$ . Now rewrite the logarithmic term as

[TABLE]

If $\sup_{t\in[0,T]}|\lambda^{\prime}(t)-\lambda(t)|$ is small, we may use the standard approximation

[TABLE]

Applying (A.16) to (A.15) and inserting the result into (A.2) causes the linear terms to cancel and we are left with

[TABLE]

We can understand the above expression as the infinitesimal distance between the path distributions corresponding to the variational parameters $\lambda$ and $\lambda^{\prime}$ for a fixed partition. From the second order expansion (A.17), in combination with (A.12), we get the natural gradient

[TABLE]

where $\varphi=g(\psi)$ and $\eta$ are evaluated for the current value of $\lambda$ via the forward and backward equations. To perform the optimization, we simply have to replace the gradient evaluation in Alg. 2 by (A.18). Doing so not only increased speed and reliability of the optimization but also led to a visually smoother transition from the prior to the posterior in all considered examples.

A.3 Parameter Inference

Statistics for Expectation Maximization

By the modified definition of the $\varphi_{i}$ as

[TABLE]

The functional $L$ changes slightly to

[TABLE]

By breaking (A.19) down into individual terms, it becomes clear how the summary statistics

[TABLE]

arise. Considering $L$ as a function of $\theta$ for fixed $\lambda$ and $\varphi$ , we may write

[TABLE]

From the last expression, we obtain a stationarity condition by differentiating with respect to $\theta$ .

Bayesian Approach

Consider a general scenario where with a hierarchical model given by

[TABLE]

where we aim to approximate the joint posterior $p(\theta,x\mid y)$ by a product $q(\theta)q(x)$ . It is straightforward to show that the optimal parameter posterior $q^{*}(\theta)$ satisfies

[TABLE]

Transferred to our setting, the expression within the exponential becomes the functional $L$ and since we only care about the parts depending on $\theta$ , we may as well insert (A.21) leading to (22).

Appendix B Examples

Here we provide explicit forms of the variational functionals of the different examples and the corresponding stationarity conditions.

B.1 Gene Expression Model

For the gene expression model, the functions $\varphi_{i}$ can be expressed in terms of the first order expectations

[TABLE]

Due to the Gaussian observation model, we also require second order moments

[TABLE]

It is therefore convenient to parametrize the dynamical system in these terms. For the first order moments, we obtain the equations

[TABLE]

For three species, we get 6 additional second order equations

[TABLE]

B.2 Predator Prey Model

As before, we take into account the first and second order moments and choose a corresponding parametrization. Suppressing the explicit time arguments for the sake of readability, the resulting system is given by

[TABLE]

As is typical for systems with non-linear intensity functions, the moment equation up to order three are not closed but depend on the (non-central) third-order moments $m_{112}$ and $m_{122}$ . In order to obtain a finite dimensional system, we have to use a moment closure method that expresses the higher order moments in terms of lower order moments

[TABLE]

While many standard moment closure approaches use ad hoc choices for the closure functions $V_{1},V_{2}$ , we follow the recently proposed variational moment closure approach (Bronstein & Koeppl, 2018) that allows for a systematic derivation of closure functions based on a set of moment functions and a distributional ansatz. In particular we use a log-normal product Poisson mixture distribution and get

[TABLE]

and a similar expression for $\mathsf{E}[X_{1}X_{2}^{2}]$ .

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amari (1998) Amari, S.-I. Natural Gradient Works Efficiently in Learning. Neural Comput. , 10(2):251–276, 1998.
2Anderson & Kurtz (2015) Anderson, D. F. and Kurtz, T. G. Stochastic Analysis of Biochemical Systems . Stochastics in Biological Systems. Springer, New York, 2015.
3Anderson (1991) Anderson, W. J. Continuous-time Markov chains : an applications oriented approach , volume 7 of Springer series in statistics. Probability and its applications . New York [u.a.], 1991.
4Andrieu et al. (2010) Andrieu, C., Doucet, A., and Holenstein, R. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 72(3):269–342, 2010.
5Archambeau et al. (2007) Archambeau, C., Opper, M., Shen, Y., Cornford, D., and Shawe-Taylor, J. Variational Inference for Diffusion Processes. In Proceedings of the 20th International Conference on Neural Information Processing Systems , pp. 17–24. Curran Associates Inc., 2007.
6Bladt & Sørensen (2005) Bladt, M. and Sørensen, M. Statistical inference for discretely observed Markov jump processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67(3):395–410, 2005.
7Blei et al. (2017) Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational Inference: A Review for Statisticians. Journal of the American Statistical Association , 112(518):859–877, 2017.
8Bonnard & Caillau (2006) Bonnard, B. and Caillau, J.-B. Introduction to Nonlinear Optimal Control. In Loría, A., Lamnabhi-Lagarrigue, F., and Panteley, E. (eds.), Advanced Topics in Control Systems Theory: Lecture Notes from FAP 2005 , pp. 1–60. Springer London, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Moment-Based Variational Inference for Markov Jump Processes

Abstract

1 Introduction

2 Background

2.1 Markov Jump Processes

2.2 Path Divergence

2.3 Conditional Processes

2.4 Variational Smoothing

3 Moment-Based Variational Smoothing

3.1 Variational Process Family

3.2 Moment-Based Divergence

3.3 Special Classes

State Space Lumping.

Data Driven Partitioning.

Agent-Based MJPs.

Population-Based MJPs.

3.4 Minimizing the Divergence.

4 Parameter Inference

5 Examples

5.1 Birth Death Process

5.2 Gene Expression System

5.3 Predator-Prey Dynamics

6 Discussion

Acknowledgements

Appendix A Additional Derivations

A.1 Decomposition of the Divergence

Lemma 1**.**

A.2 Gradient Descent for MBVI

Maximum Principle

Gradient Based Optimization

Natural Gradient

A.3 Parameter Inference

Statistics for Expectation Maximization

Bayesian Approach

Appendix B Examples

B.1 Gene Expression Model

B.2 Predator Prey Model

Lemma 1.