Numerical computation of rare events via large deviation theory

Tobias Grafke; Eric Vanden-Eijnden

arXiv:1812.00681·cond-mat.stat-mech·July 24, 2019

Numerical computation of rare events via large deviation theory

Tobias Grafke, Eric Vanden-Eijnden

PDF

TL;DR

This paper reviews numerical algorithms based on large deviation theory for computing rare event probabilities, discussing their implementation, challenges, and extensions across various stochastic systems.

Contribution

It introduces new generalizations and improvements to minimum action methods for rare event computation and explores their integration with importance sampling techniques.

Findings

01

Algorithms effectively compute rare event probabilities in diverse systems.

02

Handling degenerate or multiplicative forcing presents specific challenges.

03

Extensions to non-Gaussian noises and infinite-dimensional systems are feasible.

Abstract

An overview of rare events algorithms based on large deviation theory (LDT) is presented. It covers a range of numerical schemes to compute the large deviation minimizer in various setups, and discusses best practices, common pitfalls, and implementation trade-offs. Generalizations, extensions, and improvements of the minimum action methods are proposed. These algorithms are tested on example problems which illustrate several common difficulties which arise e.g. when the forcing is degenerate or multiplicative, or the systems are infinite-dimensional. Generalizations to processes driven by non-Gaussian noises or random initial data and parameters are also discussed, along with the connection between the LDT-based approach reviewed here and other methods, such as stochastic field theory and optimal control. Finally, the integration of this approach in importance sampling methods using…

Tables1

Table 1. Table 1: Relative error of the naive estimator and the cloning estimator for different values of ε 𝜀 \varepsilon . Each value is generated from 1000 experiments with 1000 copies each. The relative variance of the naive estimator diverges for ε → 0 → 𝜀 0 \varepsilon\to 0 .

$ε$	1.0	0.5	0.2	0.1	0.09	0.08	0.07	0.06	0.05	0.01
$e_{naive}$	0.0859	0.1399	0.4222	2.4839	3.3331	6.0061	9.4868	18.2391	31.6228	—
$e_{cloning}$	0.1487	0.1590	0.1685	0.1729	0.1849	0.1816	0.1836	0.1928	0.1969	0.2749

Equations314

d X_{t}^{ε} = b (X_{t}^{ε}) d t + ε σ d W_{t}, X_{0}^{ε} = x, t \geq 0,

d X_{t}^{ε} = b (X_{t}^{ε}) d t + ε σ d W_{t}, X_{0}^{ε} = x, t \geq 0,

P\big{\{}\sup_{t\in[0,T]}\|X_{t}^{\varepsilon}-\phi(t)\|<\delta\big{\}}\asymp\exp\left(-\varepsilon^{-1}S_{T}(\phi)\right)\,,

P\big{\{}\sup_{t\in[0,T]}\|X_{t}^{\varepsilon}-\phi(t)\|<\delta\big{\}}\asymp\exp\left(-\varepsilon^{-1}S_{T}(\phi)\right)\,,

S_{T} (ϕ) = {\int_{0}^{T} L (ϕ, \dot{ϕ}) d t \infty if the integral converges, otherwise.

S_{T} (ϕ) = {\int_{0}^{T} L (ϕ, \dot{ϕ}) d t \infty if the integral converges, otherwise.

L (ϕ, \dot{ϕ}) = \frac{1}{2} ∥ \dot{ϕ} - b (ϕ) ∥_{a}^{2}

L (ϕ, \dot{ϕ}) = \frac{1}{2} ∥ \dot{ϕ} - b (ϕ) ∥_{a}^{2}

ϕ^{*} (t) = ϕ \in C argmin S_{T} (ϕ)

ϕ^{*} (t) = ϕ \in C argmin S_{T} (ϕ)

\lim_{\varepsilon\to 0}P\big{\{}\sup_{t\in[0,T]}\|X_{t}^{\varepsilon}-\phi^{*}(t)\|<\delta\ \big{|}\ X_{T}^{\varepsilon}\in A\big{\}}=1\,.

\lim_{\varepsilon\to 0}P\big{\{}\sup_{t\in[0,T]}\|X_{t}^{\varepsilon}-\phi^{*}(t)\|<\delta\ \big{|}\ X_{T}^{\varepsilon}\in A\big{\}}=1\,.

V (x, y) = T > 0 in f ϕ \in C_{x, y} min S_{T} (ϕ),

V (x, y) = T > 0 in f ϕ \in C_{x, y} min S_{T} (ϕ),

ρ (x) ≍ exp (- ε^{- 1} V (\overset{x}{ˉ}, x)) .

ρ (x) ≍ exp (- ε^{- 1} V (\overset{x}{ˉ}, x)) .

τ_{i, j} = \mathds E in f {t > 0 ∣ X (0) = \overset{x}{ˉ}_{i}, ∥ X (t) - \overset{x}{ˉ}_{j} ∥ < δ}

τ_{i, j} = \mathds E in f {t > 0 ∣ X (0) = \overset{x}{ˉ}_{i}, ∥ X (t) - \overset{x}{ˉ}_{j} ∥ < δ}

τ_{i, j} ≍ exp (ε^{- 1} V (\overset{x}{ˉ}_{i}, \overset{x}{ˉ}_{j})) .

τ_{i, j} ≍ exp (ε^{- 1} V (\overset{x}{ˉ}_{i}, \overset{x}{ˉ}_{j})) .

\frac{\partial L}{\partial ϕ} - \frac{d}{d t} \frac{\partial L}{\partial ϕ ˙} = 0,

\frac{\partial L}{\partial ϕ} - \frac{d}{d t} \frac{\partial L}{\partial ϕ ˙} = 0,

a^{- 1} \ddot{ϕ} + (a^{- 1} \nabla b (ϕ) - \nabla b (ϕ)^{⊤} a^{- 1}) \dot{ϕ} + \nabla ⟨ b (ϕ), a^{- 1} b (ϕ)⟩ = 0 .

a^{- 1} \ddot{ϕ} + (a^{- 1} \nabla b (ϕ) - \nabla b (ϕ)^{⊤} a^{- 1}) \dot{ϕ} + \nabla ⟨ b (ϕ), a^{- 1} b (ϕ)⟩ = 0 .

θ = \frac{\partial L ( ϕ , ϕ ˙ )}{\partial ϕ ˙},

θ = \frac{\partial L ( ϕ , ϕ ˙ )}{\partial ϕ ˙},

H (ϕ, θ) = y sup (⟨ θ, y ⟩ - L (ϕ, y)),

H (ϕ, θ) = y sup (⟨ θ, y ⟩ - L (ϕ, y)),

L (ϕ, \dot{ϕ}) = θ sup (⟨ \dot{ϕ}, θ ⟩ - H (ϕ, θ)) .

L (ϕ, \dot{ϕ}) = θ sup (⟨ \dot{ϕ}, θ ⟩ - H (ϕ, θ)) .

{\dot{ϕ} = \nabla_{θ} H (ϕ, θ) \dot{θ} = - \nabla_{ϕ} H (ϕ, θ) .

{\dot{ϕ} = \nabla_{θ} H (ϕ, θ) \dot{θ} = - \nabla_{ϕ} H (ϕ, θ) .

H (ϕ, θ) = ⟨ b (ϕ), θ ⟩ + \frac{1}{2} ⟨ θ, a θ ⟩,

H (ϕ, θ) = ⟨ b (ϕ), θ ⟩ + \frac{1}{2} ⟨ θ, a θ ⟩,

{\dot{ϕ} = b (ϕ) + a θ \dot{θ} = - (\nabla b (ϕ))^{⊤} θ .

{\dot{ϕ} = b (ϕ) + a θ \dot{θ} = - (\nabla b (ϕ))^{⊤} θ .

d H / d t = ⟨ \nabla_{ϕ} H, \dot{ϕ} ⟩ + ⟨ \nabla_{θ} H, \dot{θ} ⟩ = 0 .

d H / d t = ⟨ \nabla_{ϕ} H, \dot{ϕ} ⟩ + ⟨ \nabla_{θ} H, \dot{θ} ⟩ = 0 .

S (ϕ) = \int_{0}^{\infty} L (ϕ, \dot{ϕ}) d t = \int_{0}^{\infty} (⟨ \dot{ϕ}, θ ⟩ - H (ϕ, θ)) d t = \int ⟨ θ, d ϕ ⟩ .

S (ϕ) = \int_{0}^{\infty} L (ϕ, \dot{ϕ}) d t = \int_{0}^{\infty} (⟨ \dot{ϕ}, θ ⟩ - H (ϕ, θ)) d t = \int ⟨ θ, d ϕ ⟩ .

d X_{t}^{ε} = - \nabla U (X_{t}^{ε}) d t + 2 ε d W_{t} .

d X_{t}^{ε} = - \nabla U (X_{t}^{ε}) d t + 2 ε d W_{t} .

S_{T} (ϕ)

S_{T} (ϕ)

= \int_{0}^{T} ∣ \dot{ϕ} - \nabla U (ϕ) ∣^{2} d t + \int_{0}^{T} ⟨ \dot{ϕ}, \nabla U (ϕ)⟩ d t

= \int_{0}^{T} ∣ \dot{ϕ} - \nabla U (ϕ) ∣^{2} d t + ϕ (T) - ϕ (0) .

ϕ^{*} (t) = ϕ \in C_{x, y} argmin S_{T} (ϕ),

ϕ^{*} (t) = ϕ \in C_{x, y} argmin S_{T} (ϕ),

V (x, y) = T > 0 in f ϕ \in C_{x, y} min S_{T} (ϕ),

V (x, y) = T > 0 in f ϕ \in C_{x, y} min S_{T} (ϕ),

S (ϕ) = \int_{x}^{y} ⟨ θ, d ϕ ⟩ .

S (ϕ) = \int_{x}^{y} ⟨ θ, d ϕ ⟩ .

ϕ^{*} = ϕ \in C_{x, y} argmin θ : H (ϕ, θ) = 0 sup E (ϕ, θ),

ϕ^{*} = ϕ \in C_{x, y} argmin θ : H (ϕ, θ) = 0 sup E (ϕ, θ),

E (ϕ, θ) = \int_{0}^{1} ⟨ θ, ϕ^{'} ⟩ d s .

E (ϕ, θ) = \int_{0}^{1} ⟨ θ, ϕ^{'} ⟩ d s .

E^{*} (ϕ) = θ : H (ϕ, θ) = 0 sup E (ϕ, θ)

E^{*} (ϕ) = θ : H (ϕ, θ) = 0 sup E (ϕ, θ)

ϕ^{'} = μ \nabla_{θ} H (ϕ, θ),

ϕ^{'} = μ \nabla_{θ} H (ϕ, θ),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Numerical computation of rare events via large deviation theory

Tobias Grafke

Mathematics Institute, University of Warwick, Coventry CV4 7AL, United Kingdom

[email protected]

Eric Vanden-Eijnden

Courant Institute, New York University, 251 Mercer Street, New York, NY 10012, USA

[email protected]

Abstract

An overview of rare events algorithms based on large deviation theory (LDT) is presented. It covers a range of numerical schemes to compute the large deviation minimizer in various setups, and discusses best practices, common pitfalls, and implementation trade-offs. Generalizations, extensions, and improvements of the minimum action methods are proposed. These algorithms are tested on example problems which illustrate several common difficulties which arise e.g. when the forcing is degenerate or multiplicative, or the systems are infinite-dimensional. Generalizations to processes driven by non-Gaussian noises or random initial data and parameters are also discussed, along with the connection between the LDT-based approach reviewed here and other methods, such as stochastic field theory and optimal control. Finally, the integration of this approach in importance sampling methods using e.g. genealogical algorithms is explored.

I Introduction
I.1 Freidlin-Wentzell theory of large deviations
I.2 Hamiltonian principle and connections to classical mechanics and field theory
I.3 Detailed balance and gradient flows
II Rare event algorithms for noise-induced transitions
II.1 A simplified geometric minimum action method
II.2 Example: Metastability in a simple climate model
II.3 Instantons for stochastic partial differential equations
II.4 Example: The stochastic Burgers-Huxley equation
III Rare event algorithms for expectations and extreme events
III.1 Instantons for expectations and extreme events
III.2 Example: Extreme concentration of prey in the Lotka-Volterra model
III.3 Example: Extreme amplitudes in the Korteweg-de Vries equation
III.4 Connections to optimal control
III.5 Improvements and implementation considerations
III.6 Geometric version of the Hamilton’s equations
III.7 Example: Extreme gradients of the stochastic Burgers equation
IV Generalizations to the non-Gaussian case
IV.1 Large deviation principles as WKB approximation
IV.2 Example: Genetic switch
V Systems with random parameters and extreme events
V.1 Large deviations for systems with random parameters
V.2 Example: Optimal excitation of the Fitzhugh-Nagumo model
VI Instantons as part of other rare event algorithms
VI.1 Instantons for importance sampling
VI.2 Instantons for cloning algorithms
VI.3 Example: Epidemiology and vaccination at birth
VII Conclusion

Rare events often have a drastic impact despite their low frequency of occurrence. Examples include hurricanes, financial crises, heat waves, or tsunamis, that are few and far between but have devastating consequences. Other important phenomena such as phase transitions, chemical reactions, or conformational changes of biomolecules also involve rare events. The accurate description of these events is complicated, since their low rate of occurrence makes them hard to observe both in experiments and in simulations. In many cases, when a rare event occurs, it does so in its least unlikely form, the instanton, rendering all other realizations of the same event negligible in comparison. Whenever such a situation holds, a large deviation principle (LDP) quantifies this concentration phenomenon. The LDP specifies a deterministic optimization problem to identify the instanton, and allows the estimation of its probability. In this review, we discuss numerical algorithms to solve the large deviation optimization problem, compare their associated trade-offs, and present best practices, pitfalls, improvements, and generalizations.

I Introduction

Rare but important events are by definition difficult to observe, both in experiments and in simulations. In order to design efficient schemes for the numerical computation of these events one therefore typically resorts to one of the following two strategies: either manipulation of the system in a controlled way that makes rare events more likely and can be corrected a posteriori; or computation of a single dominant event characterizing the possible ways the rare event happens. The first approach can be categorized as importance sampling; the second can be justified within large deviation theory (LDT) and leads to an action minimization problem to be solved. In this review, we focus mainly on algorithmic developments in the second class, and discuss the interplay between this LDT-based approach and importance sampling towards the end of our paper. Specifically, in section II we discuss rare event algorithms based on the global minimization of LDT action functionals, suitable for computing paths by which infrequent transitions between two prescribed states occur. Subsequently, in section III, we explain how to calculate large deviation minimizers in the context of the estimation of rare expectations dominated by tail statistics. In section IV, we generalize these two approaches to the non-Gaussian case. In section V, we demonstrate a generalization to arbitrary dynamical systems with random initial conditions and parameters. Section VI suggests possibilities to use the minimizing trajectories obtained by the earlier algorithms as input for importance sampling algorithms. Finally, some concluding remarks are presented in section VII.

In the reminder of the present section we review the aspects of LDT relevant to our purpose, with focus on Freidlin-Wentzell theory for dynamical systems subject to random noise of low amplitude.

I.1 Freidlin-Wentzell theory of large deviations

Consider a dynamical system with variables $X_{t}^{\varepsilon}$ in $\mathds{R}^{d}$ , subject to small random perturbations that are additive Gaussian and white in time. Assuming that the noise amplitude scales with the smallness-parameter $\varepsilon$ , the evolution of the stochastic variables $X_{t}^{\varepsilon}$ is described by the stochastic differential equation (SDE)

[TABLE]

with deterministic drift $b:\mathds{R}^{d}\to\mathds{R}^{d}$ , noise covariance $a=\sigma\sigma^{\top}$ with $\sigma\in\mathds{R}^{d\times d}$ , and where $W_{t}$ is a $d$ -dimensional Wiener process – for simplicity, we assume that $\sigma$ is independent of the system’s position (i.e. the noise is additive): the generalization to multiplicative noise is straightforward and will be discussed through examples. We are interested in situations where the stochastic process (1) realizes a certain event, for example when the trajectory ends at time $T$ in a given set $A\subset\mathds{R}^{d}$ , so that $X_{T}^{\varepsilon}\in A$ . Even if these events are impossible in the deterministic system ( $\varepsilon=0$ ), they will in general occur in the presence of noise ( $\varepsilon>0$ ) but they become rarer and rarer in the low noise limit, $\varepsilon\to 0$ .

Large deviation theory (LDT) gives a precise characterization of this decay of probability: The probability of observing any sample path close to a given function $\phi(t)$ can be estimated as

[TABLE]

for small enough $\delta>0$ , where $\asymp$ denotes log-asymptotic equivalence (i.e. for $\varepsilon\to 0$ , the ratio of the logarithms of both sides converges to 1). The functional $S_{T}(\phi)$ is called the rate function or action functional, and it is generally given by

[TABLE]

Here we defined the Lagrangian $L(\phi,\dot{\phi})$ , which for the concrete example of equation (1) is given by:

[TABLE]

via the $a$ -metric induced by the inner product $\|v\|_{a}^{2}=\langle v,a^{-1}v\rangle$ (assuming invertibility of $a$ for simplicity).

The probability of observing the event $X_{T}^{\varepsilon}\in A$ consists of contributions of the sample paths close to all the possible $\phi(t)\in\mathscr{C}=\{\phi(t)\in C([0,T],\mathds{R}^{d})\ |\ \phi(0)=x,\phi(T)\in A\}$ , and each of these contributions scales according to equation (2). Consequently, in the limit $\varepsilon\to 0$ , the only contribution that matters is that coming from the trajectory $\phi^{*}(t)$ with the smallest action $S_{T}(\phi^{*})$ . We call

[TABLE]

the maximum likelihood pathway (MLP) or instanton. It constitutes the least unlikely trajectory to realize the rare event, in the sense that almost surely all sample paths conditioned on the rare event will be arbitrarily close to $\phi^{*}(t)$ . More precisely, for all $\delta>0$ sufficiently small, we have

[TABLE]

The efficient numerical solution of the minimization problem (5) for different rare events (and therefore different sets of trajectories $\mathscr{C}$ to minimize over) lies at the core of this work.

If the action at the instanton, $S_{T}(\phi^{*})$ is zero, the corresponding trajectory fulfills $\dot{\phi}=b(\phi)$ and can be considered deterministic, i.e. the corresponding evolution is the one selected by the deterministic dynamics ( $\varepsilon=0$ ). If, on the other hand, the action at the instanton is finite, the probability of observing the corresponding event in a given time frame $T$ decays to zero as indicated by equation (2).

LDT additionally permits the analysis of the effect of infinitesimal perturbations over an infinite time interval, $T\to\infty$ , on which these rare events almost surely happen. The central object in this context is the quasipotential, defined as

[TABLE]

where $\mathscr{C}_{x,y}=\{\phi\in C([0,T],\mathds{R}^{d})\ \|\ \phi(0)=x,\phi(T)=y\}$ . The quasipotential characterizes the long time behavior of the system. For example, if the deterministic system $\dot{X}=b(X)$ possesses only one single stable fixed point $\bar{x}$ , with basin of attraction $\mathds{R}^{d}$ , then the density $\rho(x)$ associated with the invariant measure of equation (1) can be written in the limit $\varepsilon\to 0$ as

[TABLE]

Similarly, in situations where the deterministic system has multiple fixed points $\bar{x}_{i}$ , the mean first passage time $\tau_{i,j}$ between the basins of attraction of neighboring $\bar{x}_{i}$ and $\bar{x}_{j}$ ,

[TABLE]

with $\delta>0$ small enough, can be estimated in the small noise limit as

[TABLE]

This result also allows the investigation of the long time dynamics of the system by mapping it onto a Markov jump process whose states are the fixed points $\bar{x}_{i}$ , $\bar{x}_{j}$ , etc. and whose transition rates are $k_{i,j}\asymp\tau_{i,j}^{-1}$ .

All these examples show that it is useful to have access to the minimizing trajectory $\phi^{*}$ to describe rare events: First it gives the typical way a rare event is observed, enabling us to identify their mechanism. Second it allows the estimation of their probability of occurrence, and their expected recurrence time. Third it gives the relative probability of multiple typical (i.e. deterministically stable) states, and the most likely way by witch transitions between them occur.

I.2 Hamiltonian principle and connections to classical

mechanics and field theory

The minimization problem in equation (5) to find the instanton precisely corresponds to Hamilton’s principle from classical mechanics, $\delta S_{T}(\phi)/\delta\phi=0$ . As a consequence, the methods and ideas from classical mechanics are transferable to our situation. In particular, the variational problem can be solved by seeking solutions of the corresponding Euler-Lagrange equation,

[TABLE]

which, for a system of the type (1), gives

[TABLE]

Several algorithms presented below aim at the numerical solution of the second order equation (12).

Similarly inspired by classical mechanics, we can define a conjugate momentum

[TABLE]

and a Hamiltonian as Fenchel-Legendre transform of the Lagrangian,

[TABLE]

such that, assuming convexity of $L(\phi,\dot{\phi})$ in $\dot{\phi}$ ,

[TABLE]

The minimization (5) is then equivalent to solving Hamilton’s equations of motion, or instanton equations,

[TABLE]

For the system (1), the Hamiltonian is given by

[TABLE]

so that the instanton equations read

[TABLE]

Solving the instanton equations (16) constitutes another possible approach to solving the minimization problem (5), but care has to be taken to obtain the correct boundary conditions for (16), depending on the rare event under consideration—this point will be discussed at length below and we will see that these boundary conditions make working with (12) more appropriate in some cases and with (18) in others.

The Hamiltonian $H(\phi,\theta)$ is a conserved quantity along the minimizing trajectory, since

[TABLE]

Additional simplifications apply in the special case that the minimizing trajectory starts at rest at a fixed point $\bar{x}$ of the deterministic dynamics, in which case individually $b(\bar{x})=0$ and $\theta=0$ . This necessitates at the same time that the transition time $T$ diverges to $\infty$ , and furthermore that the Hamiltonian vanishes, $H(\phi,\theta)=0$ . This property can in turn be used to rewrite the action functional (3) as

[TABLE]

Writing the action in this form is known as the Maupertuis principle in mechanics, and it offers an approach at solving the double minimization problem to calculate the quasipotential in (7).

Interestingly, there is a parallel between the LDT discussed above and concepts from field theory applied to stochastic systems. Specifically the Janssen-de Dominicis formalism Janssen (1976); de Dominicis (1976) based on the Martin-Siggia-Rose path integral Martin, Siggia, and Rose (1973) considers computing expectations as path-integrals over all possible noise realizations, and performs a change of variables to the field variable itself. The constraint of the dynamics is embedded as Lagrange multiplier, which gives rise to an additional auxiliary field, corresponding to the conjugate momentum. Similarly, the minimization problem (5) then amounts to finding a semiclassical trajectory as saddle-point approximation of the action functional. It is this correspondence which is the root of the terms “action functional” and “instanton” for the rate function and its minimizer. Noteworthy in this context is also the Doi-Peliti formalism Doi (1976); Peliti (1986), which follows a similar route for dominant reaction pathways.

I.3 Detailed balance and gradient flows

A special case of interest is when the dynamics is a diffusion in a potential, with the drift given by the negative gradient of a potential $U:\mathds{R}^{d}\to\mathds{R}$ and $\sigma=\sqrt{2}\text{Id}$ , such that equation (1) becomes

[TABLE]

Suppose that we look at the calculation of the quasipotential (7) between two local minima of $U$ located at $x_{a}$ and $x_{b}$ and with adjacent basins of attraction. In this case, the minimum of (7) is approached by taking either $\dot{\phi}=-\nabla U(\phi)$ (in which case the action is zero), or we realize that

[TABLE]

Now, since the last terms depend only on the trajectory end-points, we are free to choose $\dot{\phi}=\nabla U(\phi)$ to make the first integral disappear. As a consequence, for a diffusion in a potential landscape of the form (21), to calculate the minimum involved in the quasipotential, we can patch together the solutions of $\dot{\phi}=\pm\nabla U(\phi)$ that connects $x_{a}$ and $x_{b}$ via the saddle point $x_{s}$ of minimum potential. We can interpret that to say that the minimum is achieved by following either the deterministic dynamics $\dot{\phi}=-\nabla U(\phi)$ , or its time-reversed version. This is nothing but a manifestation of time-reversal symmetry that is the consequence of the random process defined by equation (21) being in detailed balance.

This simple relationship between the tangential vector $\dot{\phi}$ and the deterministic drift $\nabla U$ simplifies the computation of the minimizers significantly. In particular, we realize that minimizers for gradient flows are heteroclinic orbits of the deterministic drift. As such, they are numerically accessible by the string method E, Ren, and Vanden-Eijnden (2002, 2007).

Similar simplifications as the above can be realized for any system in detailed balance, and as a results, its large deviation minimizers are always heteroclinic orbits of the associated generalized gradient flow (but not necessarily of a traditional gradient flow) Grafke (2018).

II Rare event algorithms for noise-induced transitions

In this section, we want to consider a particular sub-class of problems of the form discussed in section I.1: The computation of the optimal noise-induced transition trajectory from one basin of attraction of the deterministic dynamics to a neighboring one. In this form, the minimization problem (5) becomes

[TABLE]

where $\mathscr{C}_{x,y}=\{\phi\in C([0,T],\mathds{R}^{d})\ |\ \phi(0)=x,\phi(T)=y\}$ , and the instanton constitutes the maximum likelihood transition trajectory between the two deterministically stable fixed points $x$ and $y$ . By additionally minimizing over the transition time $T$ , the resulting instanton can be used to compute the quasipotential (7). This calculation can be performed by applying the minimum action method E, Ren, and Vanden-Eijnden (2004), that discretizes the action functional (3), and considers the discrete (finite dimensional) gradient as descent direction for numerical minimization algorithms, such as gradient descent or quasi-Newton methods. In section II.1, we present a simplified version of the minimum action method, and discuss its implementation details. In section II.2, this method is then illustrated by applying it to a simplified metastable climate model. Finally in section II.3 we discuss generalizations to stochastic partial differential equations, and consider the example of the stochastic Burgers-Huxley model in section II.4.

II.1 A simplified geometric minimum action method

One obvious disadvantage of a straightforward discretization of the Freidlin-Wentzell action functional is its inability to treat infinite transition times. In the context of the quasipotential, we are looking for transition trajectories of arbitrary transition time $T$ , which generally diverges, $T\to\infty$ , since the trajectory contains fixed points. The minimum of the outer minimization in the computation of the quasipotential,

[TABLE]

is simply not attained for any finite $T$ in these cases. This complication was successfully addressed with the geometric minimum action method Heymann and Vanden-Eijnden (2008), which instead considers a minimization over the space of arc-length parametrized curves that may remain finite even for diverging transition time. In this section, we want to introduce a simplified version of this geometric picture, allowing us to formulate an algorithm to compute of the geometric minimizer with a lower number of derivatives of the Hamiltonian.

Based on the Maupertuis principle (20), the minimizing trajectory $\phi$ between two fixed points $x$ and $y$ for arbitrary transition time $T$ fulfills $H(\phi,\theta)=0$ , and the corresponding action (20) is given by

[TABLE]

This form of the action makes it obvious that the action is invariant under reparametrization: The total action is a line-integral along the minimizer, and we are free to choose any parametrization to describe it. This enables us to treat infinite time-intervals with finitely many discretization points, for example by parametrizing (24) by normalized arc-length.

The minimization problem (5) can be rewritten as a nested optimization problem,

[TABLE]

with

[TABLE]

Here the prime denotes differentiation with respect to the parametrization $s$ we choose for $\phi$ and $\theta$ , and we impose $\|\phi^{\prime}\|_{\sim}=L=\text{const}$ , with $L$ the length of the curve. Note that the algorithm works independently of the choice of the norm, and we will discuss appropriate norms at the end of this section. Therefore, in the following, the norm $\|\cdot\|_{\sim}$ and corresponding inner product $\langle\cdot,\cdot\rangle_{\sim}$ are to be seen as a placeholder for our preferred choice.

Let

[TABLE]

and $\theta^{*}(\phi)$ be the solution of the inner optimization problem (27), such that $E(\phi,\theta^{*}(\phi))=E^{*}(\phi)$ . Then, equivalently, $\theta^{*}(\phi)$ fulfills the Euler-Lagrange equation for the constrained maximization problem (27). Using $\delta E(\phi,\theta)/\delta\theta=\phi^{\prime}$ , this Euler-Lagrange equation reads

[TABLE]

where $\mu(s)$ , $s\in[0,1]$ , is a Lagrange multiplier to enforce the constraint of a vanishing Hamiltonian. This Lagrange multiplier is explicitly computable by multiplying equation (28) by $\phi^{\prime}$ and solving for $\mu$ , i.e.

[TABLE]

Similarly, using $\delta E(\phi,\phi)/\delta\phi=-\theta^{\prime}$ the functional derivative of $E^{*}(\phi)$ with respect to $\phi$ can be expressed as

[TABLE]

where the last step makes use of

[TABLE]

which holds by definition due to $H(\phi,\theta^{*}(\phi))=0$ .

Note how in this formulation the reparametrization into arc-length emerges naturally as Lagrange multiplier $\mu$ to enforce the Hamiltonian constraint. In particular, comparing equation (28) with Hamilton’s equation with respect to physical time, $d\phi/dt=\nabla_{\theta}H(\phi,\theta)$ shows that the Lagrange multiplier $\mu$ is nothing but the change of parametrization, $\mu=dt/ds$ from physical time to arc-length parametrization.

Taking these equivalences, the nested optimization problem (25) can now be solved in an iterative manner. Starting from a the $k$ -th guess $\phi^{k}$ for the transition trajectory,

(i)

solve the inner constrained optimization problem

[TABLE] 2. (ii)

compute a descent direction for the outer optimization problem,

[TABLE] 3. (iii)

descent along the descent direction, for example by gradient descent, pre-conditioned with $\mu^{-1}$ , and step-length $\alpha$ ,

[TABLE]

to obtain the next guess $\phi^{k+1}$ , and finally 4. (iv)

iterate until convergence.

For the specific case of the small-noise Gaussian SDE, equation (1), this algorithm can be even more simplified. In particular, the inner constrained optimization problem to find $\theta^{*}(\phi)$ can be solved analytically, instead of relying on numerical optimization. Taking the Euler-Lagrange equation (28) for the inner optimization problem, together with the specific form of the Hamiltonian (17), yields

[TABLE]

so that

[TABLE]

On the other hand, the Lagrange multiplier $\mu$ is directly available without knowledge of $\theta^{*}$ : Since

[TABLE]

we conclude that

[TABLE]

(35) naturally leads to the choice $\|\cdot\|_{\sim}=\|\cdot\|_{a}$ . The descent direction is then immediately available as

[TABLE]

with $\theta^{k}$ and $\mu$ given by (33) and (35), respectively.

We want to make a few points about possible pitfalls and best practices.

•

Even though any parametrization $s(t)$ is permissible, as discussed above it is natural to choose arc-length, such that $\|\phi^{\prime}\|_{\sim}=\text{const}$ . This parametrization can be enforced, as in the improved string method E, Ren, and Vanden-Eijnden (2007) and the original geometric minimum action method Heymann and Vanden-Eijnden (2008), by interpolation along the trajectory. This avoids stiff terms enforcing the parametrization constraint.

•

Pre-conditioning is necessary to obtain good convergence. Pre-conditioning with $\mu^{-1}$ is necessary to ensure convergence around fixed points. Additionally, pre-conditioning with $\nabla_{\theta}\nabla_{\theta}H$ is often beneficial. This corresponds to the noise covariance $a$ in the SDE case. For general Hamiltonians, this comes at the cost of needing to compute higher derivatives of the Hamiltonian, which one might want to avoid. Details about these considerations are discussed in Grafke, Schäfer, and Vanden-Eijnden (2017).

•

The choice of norm has to be taken with care as well. For the additive Gaussian case, as pointed out above, it is natural to use $\|\cdot\|_{\sim}=\|\cdot\|_{a}$ . This generalizes to $\langle\cdot,(\nabla_{\theta}\nabla_{\theta}H)^{-1}\cdot\rangle$ , which is the choice of the traditional gMAM. For simplicity, the Euclidean norm might be preferred in the general case to avoid the computation of higher order derivatives of $H$ .

II.2 Example: Metastability in a simple climate model

We want to demonstrate the effectiveness of the algorithm introduced in section II.1 by applying it to a problem motivated by meta-stability in a simplified climate model introduced by Charney and DeVore Charney and DeVore (1979). Starting from the two-dimensional barotropic vorticity equation for the atmospheric flow, a projection on the 6 dominant Fourier modes is performed, resulting in an SDE system

[TABLE]

where, for $m\in\{1,2\}$ ,

[TABLE]

The original model is detailed in Charney and DeVore (1979), and was modified in (36) to add additive Gaussian noise to each degree of freedom. The model (36) allows for two metastable states, the so-called “zonal” state, and the “blocked” state, alluding to the atmospheric blocking phenomena observed in meteorology.

Application of the action minimization algorithm introduced in section II.1 allows us to compute the most likely transition trajectories in the small noise limit, $\varepsilon\to 0$ , and deduce the relative stability of the states. The results are shown in figure 1: Starting from the zonal state in the upper left corner, snapshots of the streamfunction $\psi(x,y)$ are shown along the transition trajectory in lexicographic order, arriving at the blocked state. For comparison, the right collection of plots in figure 1 shows the corresponding backward transition from blocked to zonal state. Note that the backward transition is not merely the time-reversed forward transition, implying (as expected) a breaking of time-reversal symmetry and thus demonstrating the non-equilibrium nature of the transition. The relative stability of the two configurations can be quantified via figure 2: The action to transition towards the blocked state is far larger than the action to transition towards the zonal state, meaning that the zonal state is exponentially preferred in the low-noise limit.

II.3 Instantons for stochastic partial differential equations

Many systems of interest in physical applications have continuous spatial variables, i.e. do not fit the framework of equation (1). Instead, they are stochastic partial differential equations (SPDEs). Applying the algorithm of section II.1 to stochastic processes in infinite-dimensional spaces is nevertheless largely done in practice. The mathematical foundation is less clear in this case, though, and a few comments are in order.

A stochastic partial differential equation, even in the simplest case of additive Gaussian noise, is possibly ill-posed. Consider for example

[TABLE]

where $U:[0,T]\times\mathds{R}^{d}\to\mathds{R}^{m}$ and $\eta$ denotes temporal white noise. If the noise is also not smooth in space, for example if it is white-in-space as well, $\mathds{E}\eta(x,t)\eta(x^{\prime},t^{\prime})=\delta(t-t^{\prime})\delta(x-x;)$ , it is a non-trivial undertaking to make sense of possible non-linear terms in the drift, $B(U)$ , especially if the spatial dimension is higher than one. Recent mathematical breakthroughs Hairer (2014) specify a rigorous renormalization procedure in specific cases. In regards to LDT, the main concern is whether this renormalization procedure subsists in the limit $\varepsilon\to 0$ . For example, in Hairer and Weber (2015), it was discussed for the stochastic Allen-Cahn equation (e.g. $B(U)=U-U^{3}+\kappa\partial_{x}^{2}U$ ) in 2 or 3 spatial dimensions, that indeed the rate function corresponds to the naively assumed one,

[TABLE]

where $\|\cdot\|_{L^{2}}$ denotes the $L^{2}$ -norm in the spatial components. In the following, we will consider SPDE examples, but no longer dwell upon the mathematical intricacies, instead assuming that (39) is valid.

If the rate function takes the form (39), all arguments put forward in the finite dimensional case can be transferred to the SPDE situation, and a corresponding algorithm can be constructed. In particular, a gradient descent of the form introduced in section II is still feasible, with gradients of vectors replaced by functional derivatives of the corresponding operators. Therefore, equation (31) to compute the descent direction for the SPDE (38) becomes

[TABLE]

where $A^{\top}$ is the $L^{2}$ -adjoint of the (differential) operator $A$ and $D_{\phi}$ the functional derivative. Consider for example Burgers equation with periodic boundary conditions, where

[TABLE]

Then $D_{\phi}B(\phi)$ is the operator

[TABLE]

such that

[TABLE]

Recall that we can still compute $\theta^{*}$ , i.e. the minimizer of the inner constrained optimization problem (27) via

[TABLE]

where the last equality holds for the Burgers example with spatio-temporal white noise.

In practice, equation (40) has to be rewritten in order to be practical for the SPDE case, because the involved high spatial derivatives come with stability conditions (CFL conditions) that limit the time-step of the descent, and therefore the convergence rate of the scheme. A detailed discussion of tricks and optimizations for the SPDE case is given in Grafke, Schäfer, and Vanden-Eijnden (2017).

II.4 Example: The stochastic Burgers-Huxley equation

As example for a nonlinear SPDE, we consider the stochastic Burgers-Huxley model,

[TABLE]

where $\alpha>0$ determines the strength of the nonlinear advection term, $\kappa>0$ is the diffusion constant, and the boundary conditions are periodic. The field $\eta(x,t)$ is spatio-temporal white noise. For $f(u)=0$ , this equation is the stochastic Burgers equation, arising in compressible gas dynamics, traffic flows, and as test-bed for turbulence. With the inclusion of a double-well reaction term $f(u)=u-u^{3}$ , the equation becomes metastable, with two spatially homogeneous stable fixed points at $u=-1$ and $u=1$ . The spatially homogeneous solution $u=0$ is a fixed point as well, but depending on the size of $\kappa$ might not be a saddle point with a single unstable direction. Instead, for small enough $\kappa$ , we expect Allen-Cahn like nucleation dynamics, but the nucleation must happen as a Burgers-like steepening shock wave. Indeed, as figure 3 shows, this intuition is confirmed by the numerics: The creation of the nucleus happens in a spatially asymmetric way, and the nucleating seed travels in space. The spatial resolution for the numerical computation is $N_{x}=256$ , while the temporal resolution is $N_{s}=100$ .

III Rare event algorithms for expectations and extreme events

In section II we discussed how to compute noise-induced transition trajectories in bi-stable systems and thereby estimate the rate of transitions between the two metastable states and their relative likelihood. We chose to implement a global minimization procedure based on the Maupertuis principle form of the action functional to use the independence of the choice of parametrization to our numerical advantage. Nevertheless, because both the initial and the final conditions of the transition trajectory are fixed, we were unable to harness the Hamilton’s equations of motion directly (these would have to be solved by shooting methods, which are inefficient or even ill-posed).

In this section, instead, we will concentrate on situations where it is indeed feasible to solve the minimization problem by integrating the coupled pair of equations of motion, or instanton equations, to obtain the large deviation minimizer. As we will see, if applicable, this approach comes with a couple of advantages of both theoretical and numerical nature. In this section we will therefore first review the class of algorithms based on solving Hamilton’s equations in section III.1, that can be used to compute instantons for expectations dominated by extreme events. Examples are shown in section III.2 applying this algorithm to a system with multiplicative Gaussian noise, and furthermore in section III.3 demonstrating the use in an infinite dimensional system, with the additional complication of degenerate forcing (i.e. non-invertible noise covariance matrix). We discuss connections to other fields in section III.4 and numerical details in section III.5. A geometric variant of the numerical scheme is introduced in section III.6, and implemented for an example case in section III.7.

III.1 Instantons for expectations and extreme events

For the stochastic process $X_{t}^{\varepsilon}$ of equation (1),

[TABLE]

consider the random variable $F(X_{T}^{\varepsilon})$ , where $F:\mathds{R}^{d}\to\mathds{R}$ . This random variable, also termed the “observable”, acts only on the final configuration of the process. We are interested in estimating the tail scaling of its probability density, i.e. in quantifying the likelihood of extreme values of the observable. For example, assume that $X_{t}^{\varepsilon}$ is a stochastic model describing the interaction of predator and prey in a habitat (cf. section III.2). We set out to find the probability of observing an abundance of prey. We might additionally be interested in the most likely amount of predators at this unusual configuration, and the historic development into this event. Or we have a stochastic description of waves (cf. section III.3), and are interested in the probability of observing high amplitude waves. Additionally we might ask for the most likely shape of the wave at the moment of extreme elevation, or possibly identify the evolution into the extreme wave event to analyze it for possible mechanisms leading to the amplification.

From the discussion in section I we understand that in the limit $\varepsilon\to 0$ , the probability of observing the event $F(X_{T}^{\varepsilon})=z$ , subject to $X_{0}^{\varepsilon}=x$ , fulfills

[TABLE]

where $\mathscr{C}_{z}=\{\phi\in C([0,T],\mathds{R}^{d})\ |\ \phi(0)=x,F(\phi(T))=z\}$ , i.e. the set of continuous trajectories starting and $x$ that observe the event. Let

[TABLE]

and define

[TABLE]

with $\mathscr{C}=\{\phi\in C([0,T],\mathds{R}^{d})\ |\ \phi(0)=x\}$ . Here, minimization is not constrained at the final point, i.e. $\mathscr{C}$ describes the set of continuous trajectories starting at $x$ regardless of their final point. The Gärtner-Ellis theorem states, that $I^{*}(\lambda)$ and $I(z)$ are Fenchel-Legendre duals. This can be motivated by rewriting

[TABLE]

Effectively, the connection between $I(z)$ and $I^{*}(\lambda)$ allows us to solve the minimization problem (48) instead of the original problem (47). In terms of Hamilton’s principle, the variations of the argument of the infimum in equation (48) with respect to $\phi$ gets one additional term that only applies at the final point, so that the Hamilton’s equations become

[TABLE]

The difference with Hamilton’s equations of the problem discussed in section II

[TABLE]

appears minuscule, but is profound: The $\phi$ -equation in (50) has to be solved with initial and final condition, and therefore necessitates shooting methods which are inefficient in high dimension (hence the alternative approach we took in section II). For (49), on the other hand, the equations for both $\phi$ and $\theta$ have exactly one boundary condition each. It is natural to integrate the $\phi$ -equation forward in time, starting at $x$ , while integrating the $\theta$ -equation backward in time, starting at $-\lambda\nabla F(\phi(T))$ . This direction of integration is the only sensible one in the first place: Due to the conjugate momentum equation containing the term $-(\nabla b(\phi))^{\top}$ , a numerical integration forward in time would be numerically unstable or even ill-posed. An algorithm to find the instanton in this case then consists of the following steps: Starting from the $k$ -th guess $\phi^{k}(t)$ for the instanton trajectory,

(i)

solve the equation

[TABLE]

backward in time, 2. (ii)

solve the equation

[TABLE]

forward in time to obtain the next guess $\phi^{k+1}$ , 3. (iii)

iterate until convergence.

The convergence properties, stability and possible improvements of this algorithm are discussed in section III.4. Considering the dual problem (48) instead of the original one (47) comes at a price: Instead of choosing directly the value $z$ of the observable, instead we prescribe its dual $\lambda$ , and obtain the corresponding value of $z$ a posteriori. In other words, we loose the capability of computing the instanton for a specific observable $z$ . In practice, this is usually not a problem, even though the map $z(\lambda)$ is not available in general: Typically one is interested in the complete distribution of $P(z)$ , and therefore producing instantons for a whole range of $\lambda$ similarly covers a whole range of $z$ . Alternatively, a self-correcting version of the algorithm is easily implemented, where $\lambda$ is adjusted on-the-fly to achieve the desired outcome $z$ .

Note that $I^{*}(\lambda)$ is nothing but the limit of the scaled cumulant generating function of the random variable $F(X_{T}^{\varepsilon})$ , i.e.

[TABLE]

In this interpretation, we could call the instanton solving (49) also the instanton corresponding to the expectation

[TABLE]

in the limit $\varepsilon\to 0$ . It is similarly possible to define observables not only on the final point of the trajectory, but for example of the form

[TABLE]

and perform similar arguments, leading to additional drift terms in the conjugate momentum equation.

Finally, while the above arguments rigorously hold under suitable conditions in the limit $\varepsilon\to 0$ , it is common to loosen conditions on the stochastic process and consider the case $\varepsilon$ fixed, but $\lambda\to\infty$ . The intuition is that for large $\lambda$ only extreme events of the process are considered, and a large deviation principle might hold for the observable even for finite noise. One can then write down an a priori large deviation principle for the random variable $F(X_{T}^{\varepsilon})$ and compute the instanton for large values of $\lambda$ to probe the tail of the probability to observe the event. It is in this sense that this approach can be considered as instantons for extreme events. They are commonly used in practice, for example in fluid dynamics, where an equivalent algorithm has been introduced by Chernykh and Stepanov Chernykh and Stepanov (2001), which was applied to compute instantons for Burgers, Navier-Stokes, and KPZ equations Grafke, Grauer, and Schäfer (2013); Grafke et al. (2015); Grafke, Grauer, and Schäfer (2015); Zarfaty and Meerson (2016).

III.2 Example: Extreme concentration of prey in the Lotka-Volterra model

The Lotka-Volterra system, or predator-prey system, is frequently used in biology as the simplest description of the interaction of two species, one of which preys on the other. In its typical form, it is considered without any fluctuations, but as it can be understood as continuous limit of a network of reactions, it is clear how a noise term can be derived as chemical Langevin equation.

To this end, consider a habitat with two species, the prey $A$ and the predator $B$ , where $A,B\in\mathds{Z}^{+}$ denotes the number of individuals of the respective species. We want to consider interactions between individuals, modeled by the stoichiometric reaction network

[TABLE]

Each of these is to be understood as a Poisson process with rates $\alpha$ to $\delta$ . The first three are standard in Lotka-Volterra, the last two are added to model migration of both species from neighboring habitats towards the considered location. This prevents degeneracy of the forcing at extinct population levels and the difficulty of absorbing boundary conditions.

Under the assumption that the typical populations $N$ are sufficiently large, $\varepsilon=N^{-1}\to 0$ , the chemical Langevin equation corresponding to the reaction network (56) is

[TABLE]

where $a,b$ are functions from $[0,T]$ into $\mathds{R}^{+}$ , denoting the concentration of predator and prey in the habitat. The stochastic fluctuations are white-in-time Gaussian and zero mean, but notably multiplicative. Note that while this noise term indeed is consistent with a central limit theorem for $N\to\infty$ , it is actually not true that this approximation is valid for large deviations as well. In general, the LDP is sensitive to the non-Gaussian nature of the stochastic process defined in (56), which has Poisson statistics. We explain how to treat such non-Gaussian systems correctly in section IV, while here, for simplicity, we are considering the multiplicative Gaussian SDE (57) as given.

We choose the interaction rates $\alpha,\beta,\gamma,$ and $\delta$ in a way that there exists a unique fixed point $(\bar{a},\bar{b})$ at which concentrations of predators and prey are in equilibrium. Concretely, we take $\alpha=1$ , $\beta=5$ , $\gamma=1$ , $\delta=0.1$ . Changing these parameters can produce more complicated attractors, such as limit cycles, which we will not investigate here. Instead, we are interested in the question of how unlikely high concentrations of prey develop on different time frames $T$ when the system starts at the fixed point $(\bar{a},\bar{b})$ . For that reason, we choose $F(a,b)=a(T)$ , i.e. condition on high values of $a(T)$ , regardless of $b(T)$ .

Since this is the first time we encounter multiplicative noise, a few comments are in order. For a system of the form

[TABLE]

with $a(x)=(\sigma\sigma^{\top})(x)$ , the Hamiltonian is

[TABLE]

so that an additional term enters the equation for the conjugate momentum,

[TABLE]

where the last term is to be understood as $(\langle\theta,\nabla_{\phi}a(\phi)\theta\rangle)_{i}=\sum_{j,k}\theta_{j}\nabla_{\phi_{i}}a_{jk}\theta_{k}$ . Consequently, the instanton equations for the (stochastic) Lotka-Volterra model are

[TABLE]

which have to be solved with the boundary conditions $(a(0),b(0))=(\bar{a},\bar{b})$ and $(\theta_{a}(T),\theta_{b}(T))=-\lambda\nabla F(a,b)=(-\lambda,0)$ .

Figure 4 shows the result of applying the algorithm of section III.1 to this system, and comparing to Monte-Carlo sampling. Here, two different transition times are chosen, $T=1$ and $T=10$ . For $T=10$ , the system has enough time to explore around the fixed point, but it is obvious that the last portion of the excursion, before it hits $a(T)=0.35$ , clusters around the instanton trajectory. In particular, the $b$ -coordinate at which $a=0.35$ is attained seems to be predicted reasonably well. For $T=1$ , instead, the transition trajectory needs to follow a different route, and the endpoint $a=0.35$ will most likely be attained at higher concentration of predators. Some points, such as $(a,b)=(0.25,0.3)$ , are almost never visited for $T=10$ , but are very likely under $T=1$ , which is correctly predicted by the instanton computation. Note that the heat-map depicting the empiric probability density in figure 4 has a logarithmic color-map for the tails to remain visible. Deviations from the optimal path are therefore very unlikely indeed.

Parameters for $T=1$ are $\Delta t=10^{-2}$ , $\varepsilon=0.005$ and $\lambda=0.4209$ , and for $T=10$ are $\Delta t=10^{-2}$ , $\varepsilon=0.004$ , and $\lambda=0.2106$ . Roughly $5\cdot 10^{6}$ trajectories were sampled for the Monte-Carlo estimate.

III.3 Example: Extreme amplitudes in the Korteweg-de Vries equation

Consider for the field $u(x,t):[0,2\pi]\times[0,T]\to\mathds{R}$ the stochastic partial differential equation

[TABLE]

with periodic boundary conditions in space, and $x\in[0,2\pi]$ . This is a modification of the standard Korteweg-de Vries equation that describes the evolution of shallow water surface waves. To this, we added energy input through the forcing $\eta$ and energy dissipation through a diffusion term with viscosity $\nu$ . For the forcing, we demand that, in Fourier space,

[TABLE]

where $\hat{\chi}:\mathds{Z}\to\mathds{R}$ is the forcing spectrum and $\hat{\eta}_{k}$ is the $k$ -th mode of the Fourier transform of $\eta$ .

Intuitively, for a $\hat{\chi}_{k}$ with compact support only for small $k$ , the forcing $\eta(x,t)$ inserts energy on large scales, and the nonlinearity transfers those to smaller scales, on which dispersion and dissipation act on them. We are interested how this nonlinear cascading effect produces waves of extreme amplitude. To this effect, we choose an observable

[TABLE]

with $\phi_{\Delta}(x)=A\exp\left(-x^{2}/\Delta^{2}\right)$ , and $\star$ denoting spatial convolution. For small $\Delta$ , this observable selects high amplitudes in close proximity to $x=0$ , i.e. at the center of the domain, and therefore generates high wave elevations at this position. As forcing spectrum, we want

[TABLE]

which inserts energy only into the largest mode of the system. This is the first time we consider degenerate forcing, in that only a subset of the available degrees of freedom are forced, or, equivalently, the noise covariance matrix $a$ of (1) is not invertible. This poses practical problems for algorithms based on global minimization discussed in section II, where heavy use is made of either the $a$ -norm, or $\theta$ is expressed as $\theta=a^{-1}(\dot{\phi}-b(\phi))$ . For these algorithms, the degenerate forcing introduces additional stiff constraints for the unforced modes, as those effectively behave deterministically (and thus are attained with infinite action if they deviate from the deterministic behavior). For the Hamilton’s equations, and the algorithm discussed in this section, the noise correlation is never inverted, and degenerate forcing can be treated without extra effort.

The instanton equations corresponding to the posed problem are

[TABLE]

where $\chi(x)$ is the inverse Fourier transform of the forcing spectrum. The instanton computed by solving equations (66) is depicted in figure 5. It is clearly visible that a high final amplitude around $x=0$ is achieved by a combination of non-linear advection and dispersion. Additionally, the final configuration clearly contains Fourier modes different from $|k|=1$ , implying that indeed the non-linearity cascaded energy into higher modes in a way to optimize the final amplitude. Note also that because of $\Delta\ll 1$ , we are merely demanding a large amplitude at $x=0$ , but leave the rest of the wave form unconstrained. The elevation profile in the rest of the domain is chosen in a most likely manner, and the curves shown in figure 5 can be interpreted as the prototypical way of forming the considered amplitude. The model parameters are $T=1$ , $\Delta=10^{-1}$ , $\alpha=\kappa=4\cdot 10^{-2}$ , $\lambda=1$ , and $A=0.25$ . The numerical parameters are $N_{x}=256$ , $N_{t}=1000$ , and $\Delta t=10^{-3}$ .

III.4 Connections to optimal control

It is instructive to formulate the optimization problem (46) in the language of optimal control: We are interested in finding the optimal control $p:[0,T]\to\mathds{R}^{d}$ such that for $X\in\mathds{R}^{d}$ , the system

[TABLE]

has the desired outcome, $F(X(T))=z$ . We penalize large values of $p$ by choosing

[TABLE]

as cost function. In other words, we are searching for the optimal noise realization $p$ to drive the system into a final state where $F(X(T))=z$ . To obtain a minimization procedure that honors the constraints given by the observable and equation (67), we introduce Lagrange multipliers $\mu\in[0,T]\times\mathds{R}^{d}$ and $\lambda\in\mathds{R}$ , such that we attempt to minimize

[TABLE]

Its total variation is given by

[TABLE]

We can read of the desired conditions to fulfill the constraints as

[TABLE]

and the gradient of the cost functional $E(p)$ with respect to the control $p$ is then given as

[TABLE]

We immediately identify that the conjugate momentum $\theta$ is the variable $\mu$ in optimal control, often termed the adjoint variable. Second we realize that the forward and adjoint equations are identical to the instanton equations. Therefore, the iterative algorithm given in section III.1 is nothing but a gradient descent for the cost functional $E(p)$ , with step length $1$ . This not only answer questions about (local) convergence of the algorithm of section II.1, but furthermore allows to improve stability and order of convergence of the algorithm. First, it is almost always necessary to adjust the step size for each iteration according to a line search strategy to achieve convergence. Second, one might consider pre-conditioning, to allow for faster convergence. Lastly, the computation of the descent direction $-\delta E/\delta p$ from equation (70) allows to construct higher order optimization algorithms, such as nonlinear conjugate gradient or quasi-Newton methods.

Note that, similar to the argument above, for practical reasons we choose to not consider variations with respect to $\lambda$ , and instead consider $\lambda\in\mathds{R}$ given a priori to establish a mapping $\lambda(z)$ from $z(\lambda)=F(X^{*}(T))$ , where $X^{*}(T)$ depends on $\lambda$ through the boundary condition of the adjoint equation (69).

III.5 Improvements and implementation considerations

A few remarks are in order to point out possible improvements and implementation concerns when solving Hamilton’s equations.

•

While solving a global minimization problem as introduced in section II necessitates a complicated procedure to compute descent directions, the solution of Hamilton’s equations put forward in this section usually comes at a much lower implementation cost: Given a stochastic problem at hand, one likely has already available an efficient solver of the forward equation, just replacing stochastic noise with a function of the conjugate momentum. The backward equation (auxiliary equation, adjoint equation), on the other hand, is often available as well for professional software packages, usually from automatic differentiation, in order to quantify the uncertainty from the adjoint field $\mu$ . In this case, a computation of the instanton might be achieved in a truly “black-box” form, where the iterative solution of the Hamilton’s equations can be considered as a pure outer problem.

•

The mutual dependency of the forward and backward equations necessitates in principle that the whole trajectory is stored in its entirety. While this is usually feasible for finite-dimensional problems, it quickly becomes prohibitive in terms of memory requirement when talking about SPDEs in many dimensions, where usually the storage requirements are chosen to be of the order of magnitude of a single state of the field variables, and not a continuous trajectory. In optimal control, this restriction is usually overcome via checkpointing mechanisms, where one does not save to memory a complete trajectory, but instead only retains checkpoints, from which subsequent states can be deduced. In its most efficient form, this checkpointing can be implemented in a recursive manner, leading generally to memory requirements of $O(\log N_{t})$ instead of the naive $O(N_{t})$ , where $N_{t}$ is the number of discretization points in time. Some details of the application of this method to instanton computations is laid out in Grafke, Grauer, and Schindel (2015).

•

The nature of the conjugate momentum as adjoint variable, as discussed in section III.4, highlights its adjointness to the main field variable. Would one discretize the action, and consider forward and backward equation as their discrete variation, this state of affairs would necessitate the usage of a special temporal integrator for the numerical solution of the adjoint equation, namely a temporal integrator that is adjoint to the forward integrator. Some time integrators have the property of being self-adjoint, and thus can be used for both equations. The failure to use a correct pair of integrators generally results in the failure to converge to the minimum of the cost function.

III.6 Geometric version of the Hamilton’s equations

As discussed in section I, many questions, including the computation of the quasi-potential, necessitate a minimization not only over all possible paths $\phi(t)$ , but also over all possible time intervals $T>0$ —for example, this is needed to calculate expectations with respect to the invariant measure of the process, assuming it exists. For MAM-type algorithms, the additional complication the minimization over $T>0$ introduces is resolved by invoking Maupertuis principle and focusing on the computation of the geometric minimizer, i.e. realizing that the minimizing trajectory can be computed without explicit reference to its parametrization.

For algorithms based on the Hamilton’s equation, similar ideas and extensions exists Grafke et al. (2014): Instead of solving the original Hamilton’s equations (16), we can choose a reparametrization $s(t)$ , and consider Hamilton’s equations in this parameter,

[TABLE]

where the prime denotes derivatives with respect to $s$ and $\mu=dt/ds$ . Now, as pointed out in section II.1, $\mu$ can be interpreted as Lagrange multiplier enforcing the Hamiltonian constraint, and is available as

[TABLE]

where the much simpler second form only holds for Gaussian additive noise of the form (1) (compare section II.1).

III.7 Example: Extreme gradients of the stochastic Burgers equation

As an example, following Grafke et al. (2014), consider for the field $u(x,t):[-L/2,L/2]\times[-T,0]\to\mathds{R}$ the stochastic Burger’s equation,

[TABLE]

with periodic boundary conditions in space. Here, we consider a noise term $\eta$ that is white in time, but has a finite correlation length in space,

[TABLE]

and we prescribe the specific correlation in Fourier space of

[TABLE]

where $\mathcal{H}$ denotes the Heaviside step function. In effect, the forcing correlation has the shape of a “Mexican hat” function, with cut-off wave number $k_{c}$ . Effectively, equation (73) can be considered as a test-bed for turbulence, where energy is inserted on large scales due to the forcing, and then cascades to smaller scales via the nonlinearity, where it dissipates. We focus on events that lead to a strong negative gradient at the final time, $t=0$ , and therefore choose an observable

[TABLE]

where, identically to the KdV-case in (64), $\phi_{\Delta}$ mollifies on scales $\Delta$ , so that here we concentrate on high gradients at the origin.

In order to probe for events on the invariant measure of (73), we want to consider the limit $T\to\infty$ , and therefore need to either consider extremely large time intervals $T$ , or alternatively employ the geometric variant of the Hamiltonian formalism as proposed in (71), where the norm is induced by the covariance $\chi(x)$ , i.e. $\|v\|=\langle v,\chi^{-1}(x)v\rangle^{1/2}$ on its support. For technical details, see Grafke et al. (2014). Given this setup, figure 6 compares the final condition of the instanton between finite times $T$ and the limit $T\to\infty$ obtained in the geometric formalism. It demonstrates how, for choices $T=10$ or $T=100$ , unphysical secondary maxima are present (compare inset of figure 6), that disappear in the infinite time case, and with $T=1000$ . Similarly, one can look at the value of the Hamiltonian $H(u,p)$ , which necessarily disappears for large $T$ , $\lim_{T\to\infty}H=0$ , and consider the quantity $\max(|H|)$ along the instanton trajectory as measure of the numerical error of the discretization. In this metric, the geometric variant for the same number of discretization points in time, is roughly $10^{4}$ times more accurate than the naive parametrization with physical time Grafke et al. (2014).

IV Generalizations to the non-Gaussian case

In all above considerations and examples, we always considered the presence of an LDP for an SDE, either with additive or with multiplicative Gaussian noise. The class of stochastic systems treatable with algorithms of the above form is far larger, though, as is evidenced by the fact that both MAM-style global minimization of section II and the algorithms based on the solution of Hamilton’s equations of section III are written in terms of a generic large deviation Hamiltonian. In this section, we therefore intend to broaden the scope by demonstrating how more generic large deviation Hamiltonians are obtained, and corresponding instantons can be computed with the above algorithms. In particular, the Gaussianity of the underlying stochastic process is reflected in the fact that the Hamiltonian is quadratic in its conjugate momentum. Other processes, most notably those that result from limits of continuous time Markov jump processes (MJPs), generally lead to a non-quadratic Hamiltonian.

IV.1 Large deviation principles as WKB approximation

Consider a homogeneous continuous time Markov jump process $X_{t}$ , $t\in[0,T]$ , with state space $\mathcal{E}$ . The process is completely characterized by its generator $\mathcal{L}$ , which allows us to write down its backward Kolmogorov equation (BKE) as

[TABLE]

for

[TABLE]

where $n\in\mathcal{E}$ , $f:[0,T]\times\mathcal{E}\to\mathds{R}$ , and $\phi:\mathcal{E}\to\mathds{R}$ . For concreteness, consider as a state space a counting space $\mathcal{E}=\mathds{Z}_{+}^{N}$ for $N\in\mathds{N}$ different species, where individuals of each species can interact via $R\in\mathds{N}$ independent Poisson processes, manipulating the number of individuals of each species. For each of the $R$ different interactions, the number of individuals changes from $n$ to $n+\nu_{r}$ with a rate $\tilde{a}_{r}(n)$ , $r\in\{1,\dots,R\}$ . Then, the corresponding generator reads

[TABLE]

Rescaling this into new variables $x=n/M$ , where $M$ is a typical number of individuals, we can expand in $\varepsilon=M^{-1}$ to obtain a large deviation principle in the limit of many individuals. To this end, consider the generator in the rescaled variables,

[TABLE]

where $a_{r}$ is defined on the rescaled variables. We now evaluate this rescaled generator onto a function of the form $\exp(\varepsilon^{-1}g(x))$ and rescale time with $\varepsilon$ appropriately. This corresponds to a WKB approximation of the BKE, or equivalently to the method of Feng and Kurtz Feng and Kurtz (2006), and yields

[TABLE]

which can be expanded, to leading order in $\varepsilon$ , into

[TABLE]

Equation (81) can be interpreted as Hamilton-Jacobi equation

[TABLE]

with

[TABLE]

The Hamiltonian (83) is precisely the large deviation Hamiltonian in that the large deviation rate function is the time integral of its Fenchel-Legendre transform. The Hamiltonian (83) is furthermore a prime example of a non-quadratic Hamiltonian.

Note that applying the same method to the generator of the SDE (1),

[TABLE]

recovers exactly the expected Hamiltonian (17) for the leading order in $\varepsilon$ .

IV.2 Example: Genetic switch

As an example for instantons of a non-Gaussian LDP for a continuous time MJP of the form of section IV.1, we want to consider a simplified model of a genetic switch: Inside a bacterium, plasmids contain genes that encode two different proteins, $A$ and $B$ . Each protein is able to form polymers that inhibit the production of the other protein, respectively. In the emerging situation, the cell may exist in a state close to one of two fixed points: Either protein $A$ dominates, and inhibits the production of protein $B$ , while the production of $A$ remains high. Alternatively, protein $B$ dominates, and inhibits the production of $A$ . Rarely, fluctuations may arise that push the system from one fixed point to the other.

We choose to model this system by describing the concentrations of proteins $A$ and $B$ by $a\in\mathds{R}_{+}$ and $b\in\mathds{R}_{+}$ , respectively. There are four reactions in total, namely production and degradation of $A$ and $B$ , leading to the reaction network

[TABLE]

with rates

[TABLE]

The corresponding large deviation Hamiltonian, using equation (81), is thus

[TABLE]

The minimizer for this setup, as well as the relaxation paths from the saddle, are depicted in figure 7. They are computed by implementing the algorithm presented in section II.1 for the Hamiltonian (87) for the transition between the two fixed points. The model parameters here are chosen to be $C=156$ and $D=30$ .

V Systems with random parameters and extreme events

Up to now, all discussions in the previous sections concerned sample path large deviations, where a stochastic process realizes a rare event almost surely by following a trajectory that minimizes the corresponding action functional. In this section, we are focusing on a related, but different setup of a dynamical system with random parameters. Given a distribution of the random parameters, we want to reason about probabilities to observe certain events, and again characterize the rare ones by dominating configurations of parameters. To this effect, consider for $u:[0,T]\to\mathds{R}^{d}$ the dynamical system

[TABLE]

where $\theta\in\Omega\subseteq\mathds{R}^{M}$ is the set of $M$ random real parameters, distributed according to a measure $\mu$ . Since the initial conditions and the drift of equation (88) depend on the random parameters, the solution is a random variable, denoted by $u(\cdot,\theta)$ . We can then try to quantify the probability of an observable exceeding a threshold $z\in\mathds{R}$ , for example at the final time, or integrated over time, or as temporal maximum, i.e.

[TABLE]

V.1 Large deviations for systems with random parameters

Indeed, if in the limit of large $z$ this probability becomes small, $\lim_{z\to\infty}P_{T}(z)=0$ , then under some additional assumptions on $F:\Omega\to\mathds{R}$ we have a large deviation principle of the form

[TABLE]

where the set $\Omega(z)\subseteq\Omega$ is the set of permissible random parameters,

[TABLE]

and the rate function $I(\theta)$ is obtained as the Legendre transform of the cumulant generating function of $\theta$ ,

[TABLE]

for

[TABLE]

The minimizer of $I(\theta)$ in $\Omega(z)$ , i.e.

[TABLE]

dominates the occurrence of the event, and is an instanton in this sense. Since we are considering only finite dimensional $\Omega$ , the corresponding optimization problem (94) has to be solved in a generally smaller search space. Equivalently, here, the instanton is not a preferred trajectory of the system, but instead the maximum likelihood set of parameters that lead to the event. Of course, given any $\theta\in\Omega$ , there is a unique trajectory $u(\theta)$ solving (88) associated to it, so that $u(\theta^{*})$ represents the most likely trajectory of the system to realize the rare event. The proof of the large deviation principle (90) is carried out in Dematteis, Grafke, and Vanden-Eijnden (2018).

From a numerical perspective, we can again solve the constrained optimization problem (94) by instead considering a Lagrange multiplier $\lambda\in\mathds{R}$ . For example, for the first of the three cases in (89), we attempt to minimize the objective function

[TABLE]

Via the Jacobian $J_{ij}(t)=\partial u_{i}(t)/\partial\theta_{j}$ , i.e. the variation of the current configuration with respect to the random parameters, one can express the gradient as

[TABLE]

where the Jacobian is available through

[TABLE]

While integrating the forward equation (88) and the Jacobian (97) allows us to evaluate the gradient for a given $\theta$ , note that $J:[0,T]\to\mathds{R}^{d\times d}$ is quite costly to compute. Instead, we can again fall back to an adjoint formulation to overcome this limitation. To this effect, consider the adjoint field $\mu:[0,T]\to\mathds{R}^{d}$ , subject to the adjoint equation

[TABLE]

Since, using equations (97) and (98), we have

[TABLE]

it follows that

[TABLE]

and thus, the gradient (96) is computable without referring to the Jacobian as

[TABLE]

In total, the gradient of the objective function (95) can be computed at a given $\theta$ in three steps:

(i)

Integrate the forward equation,

[TABLE] 2. (ii)

Compute the adjoint field $\mu$ by integrating

[TABLE] 3. (iii)

and finally, compute the gradient

[TABLE]

A few comments are of note:

•

In contrast to the sections before, in the current setup we are not considering the case of small noise. Instead, the fluctuations are held at fixed amplitude, but we consider events in the limit of infinite threshold, $z\to\infty$ . It is in this limit that the LDP in (90) is obtained, which might lead to a non-standard large deviation speed as a consequence. Therefore, the LDT computation discussed in this section can truly be considered for an extreme event instanton.

•

Again, the formulation in the form of an adjoint equation (98) is often beneficial from an implementation perspective as well: Many software packages for complex systems contain the computation of the adjoint field. Therefore, the computation of the gradient can possibly be achieved in a black-box manner.

V.2 Example: Optimal excitation of the Fitzhugh-Nagumo model

As an example, consider the following version of the deterministic Fitzhugh-Nagumo model,

[TABLE]

If we consider the case $a>1$ , this model is an excitable system, in that there is a unique fixed point $(\bar{x},\bar{y})=(-a,\frac{1}{3}a^{3}-a)$ , but small perturbations out of this fixed point potentially lead to large excursions until the system returns to its steady state. Here, we are interested in the optimal perturbation of the initial condition away from the fixed point to achieve a large excursion. For $\theta\in\Omega=\mathds{R}^{2}$ , we define the distribution of initial conditions as Gaussian centered around the fixed point,

[TABLE]

and take as observable

[TABLE]

i.e. the maximal excursion in the $x$ -component of the trajectory. We want to know $P(z)=P(F(\theta)\geq z)$ , and the corresponding most likely initial conditions (and trajectory) that realize this extreme event. Note that (101) is an observable of the third form of (89), and the algorithm lined out above has to be modified slightly. In particular, for $u(t)=(x(t),y(t))$ we obtain the gradient $\nabla_{\theta}E$ from forward and backward equations, which are, respectively,

[TABLE]

where $t\in[0,t^{*}]$ and $t^{*}$ is the time at which the maximum is reached. We are considering only the first local maximum in time, but the dynamics of (100) are such that the first local maximum necessarily is the global maximum. In this situation, there is no additional term coming from the dependence of $t^{*}$ on the random parameters, since

[TABLE]

and the second term disappears because at $t^{*}$ we have $\partial_{t}u(t^{*})=0$ . The gradient can then be computed as

[TABLE]

for $u_{0}(\theta)=\theta\in\mathds{R}^{2}$ and $\bar{u}=(\bar{x},\bar{y})$ .

Figure 8 shows the result of the minimization procedure: For $\Delta=2,\lambda=1,a=1.1,\nu=10^{-1}$ , the shading indicates the objective function (95) for every initial condition. One can clearly make out the jump in the objective function across the separatrix, where trajectories start exhibiting large excursions. The trajectory starting at the minimum is the one that maximizes the excursion in $x$ -direction, before it decays to the fixed point. The streamlines show the dynamics of the Fitzhugh-Nagumo model (100).

VI Instantons as part of other rare event algorithms

While instantons as prototypical realizations of rare events can be used for their own sake to estimate probabilities, relative stability, and transition mechanisms, they can also be helpful as ingredient to increase efficiency of other types of rare event algorithms. Most notably, whenever rare events are sampled numerically by tilting a given stochastic process to facilitate a rare event in an importance sampling setup, the instanton can be considered as the optimal tilt. In this section, we want to discuss how this can be achieved in practice, and common pitfalls of this strategy: First, in section VI.1, we will show how to use instantons to perform importance sampling for Monte-Carlo methods. In section VI.2, we use instantons to construct weighting functions for genealogical particle algorithms. This will be accompanied by an example computing the probability of infection rates in a stochastic model for epidemiology.

VI.1 Instantons for importance sampling

Consider an expectation of the form

[TABLE]

for a random process $X^{\varepsilon}_{t}\in\mathds{R}^{d}$ , for example the one obeying an SDE like (1). We saw in section III how to compute the corresponding instanton and get the dominating contribution in the limit $\varepsilon\to 0$ . In order to get hold of a proper quantitative estimate of (103), though, one would naively consider a Monte Carlo method with estimator

[TABLE]

where $\{X^{\varepsilon,i}_{t}\}_{i=1}^{N}$ are $N$ independent realizations of the process. This estimator is unbiased, meaning that

[TABLE]

The relative error of this estimator,

[TABLE]

describes the relative variance of the estimator. For example, for $e(\delta_{\varepsilon})=1$ , the typical fluctuations of the estimate are of the size of the estimated value itself. The goal is to achieve a small relative error. In practice, for rare events, one often struggles to even keep $e(\delta_{\varepsilon})$ bounded for $\varepsilon\to 0$ : Even though increasing the number of samples improves the quality of the estimate, $e(\delta_{\varepsilon})\to 0$ for $M\to\infty$ , the relative error increases exponentially for fixed $M$ as $\varepsilon\to 0$ . As a consequence, estimating rare events with the naive estimator (104) is impractical as the variance blows up. The standard answer to this problem is to employ importance sampling, i.e. introducing a new process $Y^{\varepsilon}_{t}$ under which the rare event becomes typical, but accounting for this change of probability measure by correcting with the proper Girsanov-factor. Indeed, considering

[TABLE]

for some function $v:[0,T]\times\mathds{R}^{d}\to\mathds{R}^{d}$ , we can express the expectation (103) as

[TABLE]

where

[TABLE]

This identity can be used to construct an unbiased estimator of $A^{\epsilon}$ by replacing the expectation in (107) by an empirical expectation over $M$ independent copies of $Y^{\varepsilon}_{t}$ , similar to what was done to obtain the vanilla estimator (104). The question is how to best choose the importance sampling bias $v(t,x)$ to lower the variance of this new estimator. An intuitive idea would be to use the instanton to do so Bucklew (2004); Pelissetto and Ricci-Tersenghi (2014). For example, it has been suggested to take

[TABLE]

where $(\phi(t),\theta(t))$ is the instanton position and momentum corresponding to the expectation (103), i.e. taking $Y_{t}^{\varepsilon}$ to be, respectively, the stochastic process

[TABLE]

The intuition is that using either one of the processes in (110) biases the dynamics towards the dominating path $\phi(t)$ . Similar ideas, inspired from lattice quantum chromodynamics, have entered through stochastic field theory to bias Monte Carlo methods with the knowledge of the instanton. For example in Ebener et al. (2018) the instanton for the stochastic Burgers equation is used precisely in the way of (110) to sample a modified process describing the fluctuations around it, getting improved statistics in the rare event regime.

Although it has been pointed out that this strategy does not succeed to decrease variance in general, or might even perform worse than the naive one in the limit as $\epsilon\to 0$ Glasserman and Wang (1997), it can be modified Vanden-Eijnden and Weare (2012) to achieve optimal variance decay by recomputing the instanton trajectory on-the-fly.

VI.2 Instantons for cloning algorithms

There is another way to incorporate knowledge of the instanton within importance sampling, namely through algorithms of genealogical type Del Moral (2004); Cerou and Guyader (2007); Giardina et al. (2011); Wouters and Bouchet (2016). In these methods, an ensemble of trajectories (aka particles, copies, or clones) is integrated, and particles are removed or duplicated according to some rating that selects behaviors favorable to the event at hand. To explain how this can be done in the context of rare event algorithms, let us focus on the second choice for $v(t,x)$ in (109) since out of the two it is the one that requires the least modification of the drift111In complex situations, such as applications in meteorology or climate where only a block-box type solver is available, explicit modification of the drift should be kept at a minimum.. The second equation in (110) reads

[TABLE]

along with the estimator for (103)

[TABLE]

We begin by rewriting this last formula in a form that is more convenient for resampling. To this end, let us integrate the following identity

[TABLE]

to get

[TABLE]

where we defined

[TABLE]

These manipulations allow us to write the expectation (112) as

[TABLE]

where

[TABLE]

The expression (116) can be used to design a genealogical algorithm in which $W^{\varepsilon}_{t}$ is viewed as a weight that each of the particles carries and according to which they are periodically resampled. More concretely, consider $M$ copies of the system, all evolving according to the SDE (111). Denote by $Y_{t}^{i}$ the position of the $i$ -th copy at time $t$ and by $W_{t}^{i}$ its weight. The particle positions and weights are evolved independently on intervals $(t_{k-1},t_{k})$ , where $t_{k}$ , $k\in\mathds{N}_{0}$ with $t_{k-1}<t_{k}$ are selection steps when the resampling occurs. It proceeds as follows: Denoting by $\Delta W^{i}_{k}$ the weight accumulated by particle $i$ on the interval $(t_{k-1},t_{k})$ , i.e.

[TABLE]

we compute

[TABLE]

and choose independently (with replacement) $M$ copies in the set $\{Y^{i}_{t_{k}}\}_{i=1}^{M}$ , using probability $p^{i}_{k}$ to pick copy $Y^{i}_{t_{k}}$ . We then use the resulting copies as new set $\{Y^{i}_{t_{k}}\}_{i=1}^{M}$ , assign to each the same weight $W^{i}_{t_{k}}=\prod_{l=1}^{k}\Delta\bar{W}_{l}$ , and repeat on the next interval $(t_{k},t_{k+1})$ .

As a result of this procedure, we have at any time a set of $M$ copies with nearly uniform weights (since they only diverge from one another during the intervals $(t_{k},t_{k+1})$ ), that provides us with the following expression for $A^{\varepsilon}$ (compare with (116))

[TABLE]

where $\mathds{E}^{x}$ denotes expectation over both the noise term in (111) and the resampling steps. If $M$ is large enough we can simply build an unbiased estimator for $A^{\varepsilon}$ by using $\zeta_{M}$ directly (i.e, removing the expectation); or we can repeat the estimation $R$ times with $M$ copies and replace the expectation in (120) with an empirical average over the values of $\zeta_{M}$ calculated in these $R$ runs.

The variance of the estimator based on (120) has been analyzed e.g. in Del Moral (2004); Cerou and Guyader (2007), where other variants of the algorithm (e.g. in terms of the resampling step) are also discussed. Let us simply mention here that this estimator may not be better behaved than the one directly based on (116) (i.e. the one using no resampling based on the values of the weights), but it offers multiple possibilities of modifications that can systematically improve its variance—we refer the reader to Ferré, Grafke, and Vanden-Eijnden for more details.

Similar approaches are possible with adaptive multilevel splittingCerou and Guyader (2007); Bréhier et al. (2016), where again a rating function (“reaction coordinate”) has to be found to evaluate the performance of multiple copies, and for which the instanton dynamics can be taken as input.

VI.3 Example: Epidemiology and vaccination at birth

To illustrate the scheme above, we consider the following compartmental model inspired from epidemiology, where the spread of a disease is modeled in the presence of vaccination. The total population of individuals, denoted by $N$ , is comprised of individuals susceptible to the disease ( $S$ ), individuals that are infected ( $I$ ), individuals that are recovered and thus immune ( $R$ ) and individuals that are vaccinated and thus immune ( $V$ ). Individuals are born and die with the same rate $\mu$ , so that the total population remains constant. In this simple rendition of disease spread with vaccination, the vaccine is administered at birth, and with a vaccination rate of $q$ . In other words, a child is born vaccinated with probability $q$ or susceptible otherwise. The disease is transmitted by contact between infected and susceptible individuals, with contact rate $\beta$ , while recovery is associated with the recovery rate $\gamma$ . In total, the model therefore reads

[TABLE]

Interestingly, depending on the vaccination rate, the model (121) results in either total eradication of the disease after transient dynamics (disease free equilibrium), or a fixed point where the disease is still present (endemic equilibrium). More precisely, the reproduction number

[TABLE]

describes the average number of contacts per infected individual (i.e. the ratio between contact frequency and the frequencies associated with recovery or death). If the vaccination rate $q$ exceeds a threshold $q^{*}$ ,

[TABLE]

then the disease will be eradicated eventually. Note that, since the dynamics of $S$ and $I$ are independent of $V$ and $R$ , it is enough to consider the first two equations of (121) to establish whether the disease is eradicated in the long-time limit. Furthermore, we will normalize the quantities to ratios in $[0,1]$ .

In order to produce estimates of probabilities in this system, we furthermore need to make assumptions about stochasticity present in the quantities $S$ and $I$ . It is natural to interpret the rate equations (121) as the law of mass action of a reaction network, transforming the species into each other, which would lead to Poisson noise terms as encountered in section IV. For large population sizes, one could also consider a multiplicative Gaussian noise, consistent with the central limit theorem, similar to the discussion of the stochastic Lotka-Volterra model in section III.2. Using this approximation, the stochastic system reads

[TABLE]

where $W_{S}$ and $W_{I}$ are independent Wiener processes.

As observable, we want to estimate the probability that after time $T$ we have reached an unusually high ratio $z\in[0,1]$ of infected individuals, $P(I(T)\geq z)$ , which we can write as expectation via

[TABLE]

for the Heaviside step function $\Theta$ . Using this observable, we can compare the naive estimator with the cloning estimator.

We are choosing parameters $\mu=0.1$ , $\beta=0.8$ , $\gamma=0.2$ and $N=1500$ , which result in a critical vaccination rate of $q^{*}=0.625$ . Here, we set $q=0.1$ instead, resulting in an endemic equilibrium $(\bar{S},\bar{I})=(0.375,0.175)$ . The threshold is set to $z=0.2$ , i.e. we want to estimate the probability to have a ration of infected individuals above 20% after $T=100$ . The results for three different values of $\varepsilon$ are shown in figure 9, depicting the logarithm of a histogram of final trajectories $(S(t),I(t))$ . The streamlines describe the deterministic drifts given in equation (121), while the white line is the instanton to reach the threshold. Indeed, we observe that for $\varepsilon=1$ , reaching the threshold infected rate is not a rare event, the instanton has no predictive power and the cloning algorithm performs equally or worse to naive sampling. For $\varepsilon=10^{-1}$ , the event is rarely observed for naive sampling, resulting in larger variance estimates of the probability. The configurations resulting from the cloning algorithm instead show a higher prevalence of increased infection rates. This effect is especially pronounced for $\varepsilon=10^{-2}$ , where we do not observe any sample reaching the threshold in the naive case, and where the cloning samples clearly track the instanton trajectory towards the threshold. The relative errors of the two estimators are summarized in table 1. With decreasing $\varepsilon$ (and therefore exponentially decreasing probability of the rare event), the variance of the naive estimator blows up, while the variance of the cloning estimator remains largely unchanged.

VII Conclusion

Summarizing, in this review we presented a collection of algorithms to estimate rare event probabilities and properties by computing the large deviation minimizer (instanton). They are largely divided into two categories:

In the first category, one minimizes the rate function globally, by discretizing it and then employing numerical minimization techniques. Traditional members of this category are the minimum action method (MAM) E, Ren, and Vanden-Eijnden (2004) and the geometric MAM (gMAM) Heymann and Vanden-Eijnden (2008). Here, we provide a simplified and optimized version of the second, the simplified gMAM, that allows for carrying out the optimization in the space of arc-length parametrized curves with a minimal number of necessary derivatives of the large deviation Hamiltonian. Effectively, this translates into gains in either run-time or implementation complexity over traditional variants. Methods in this category are particularly suited for computing transition trajectories between two sets or points.

In the second category, one instead solves the Hamilton’s equations (or instanton equations) associated with the large deviation Hamiltonian. In this category are the Chernykh-Stepanov algorithm Chernykh and Stepanov (2001) and its geometric variant Grafke et al. (2014). Here, we provide an interpretation of these algorithms in form of the adjoint formulation of the optimization problem. Methods in this category are effectively employed when the intention is to compute expectations along sample paths, or loosely speaking most likely realizations of extreme events.

Even though these formalism constitute dual approaches to the same problem, we saw that they are drastically different in terms of applicability: For example, degenerate forcing is easily incorporated into Hamilton’s equations, but constitutes a numerical difficulty in the form of stiff constraints for MAM-type algorithms. Conversely, traversing a saddle point or crossing a separatrix is readily achieved in MAM-type schemes, but leads to loss of convergence in the equations of motion formulation.

Nevertheless, both approaches can be generalized to treat SDEs driven by multiplicative noise as well as stochastic processes driven by non-Gaussian noise. They can also handle, at least formally, infinite dimensional processes, like the solutions of SPDEs. These approaches can also be extended in multiple ways. Here we discussed how related considerations apply to the case of dynamical systems with generic random parameters and we also showed how instantons can be used as input in importance sampling algorithms.

Acknowledgements.

We thank Freddy Bouchet, Grégoire Ferré, Robert Jack, and Jonathan Weare for useful discussions about genealogical algorithms. EVE is supported in part by the Materials Research Science and Engineering Center (MRSEC) program of the National Science Foundation (NSF) under award number DMR-1420073 and by NSF under award number DMS-1522767.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Janssen (1976) H.-K. Janssen, “On a Lagrangean for classical field dynamics and renormalization group calculations of dynamical critical properties,” Zeitschrift für Physik B Condensed Matter 23 , 377–380 (1976) . · doi ↗
2de Dominicis (1976) C. de Dominicis, “Techniques de renormalisation de la théorie des champs et dynamique des phénomènes critiques,” J. Phys. C 1 , 247 (1976).
3Martin, Siggia, and Rose (1973) P. C. Martin, E. D. Siggia, and H. A. Rose, “Statistical Dynamics of Classical Systems,” Physical Review A 8 , 423–437 (1973) . · doi ↗
4Doi (1976) M. Doi, “Second quantization representation for classical many-particle system,” Journal of Physics A: Mathematical and General 9 , 1465 (1976) . · doi ↗
5Peliti (1986) L. Peliti, “Renormalisation of fluctuation effects in the A+A to A reaction,” Journal of Physics A: Mathematical and General 19 , L 365 (1986) . · doi ↗
6E, Ren, and Vanden-Eijnden (2002) W. E, W. Ren, and E. Vanden-Eijnden, “String method for the study of rare events,” Physical Review B 66 , 052301 (2002) . · doi ↗
7E, Ren, and Vanden-Eijnden (2007) W. E, W. Ren, and E. Vanden-Eijnden, “Simplified and improved string method for computing the minimum energy paths in barrier-crossing events,” J. Chem. Phys. 126 , 164103 (2007).
8Grafke (2018) T. Grafke, “String Method for Generalized Gradient Flows: Computation of Rare Events in Reversible Stochastic Processes,” (2018) .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Numerical computation of rare events via large deviation theory

Abstract

Contents

I Introduction

I.1 Freidlin-Wentzell theory of large deviations

I.2 Hamiltonian principle and connections to classical

I.3 Detailed balance and gradient flows

II Rare event algorithms for noise-induced transitions

II.1 A simplified geometric minimum action method

II.2 Example: Metastability in a simple climate model

II.3 Instantons for stochastic partial differential equations

II.4 Example: The stochastic Burgers-Huxley equation

III Rare event algorithms for expectations and extreme events

III.1 Instantons for expectations and extreme events

III.2 Example: Extreme concentration of prey in the Lotka-Volterra model

III.3 Example: Extreme amplitudes in the Korteweg-de Vries equation

III.4 Connections to optimal control

III.5 Improvements and implementation considerations

III.6 Geometric version of the Hamilton’s equations

III.7 Example: Extreme gradients of the stochastic Burgers equation

IV Generalizations to the non-Gaussian case

IV.1 Large deviation principles as WKB approximation

IV.2 Example: Genetic switch

V Systems with random parameters and extreme events

V.1 Large deviations for systems with random parameters

V.2 Example: Optimal excitation of the Fitzhugh-Nagumo model

VI Instantons as part of other rare event algorithms

VI.1 Instantons for importance sampling

VI.2 Instantons for cloning algorithms

VI.3 Example: Epidemiology and vaccination at birth

VII Conclusion

Acknowledgements.