Rare Event Simulation for Steady-State Probabilities via Recurrency   Cycles

Krzysztof Bisewski; Daan Crommelin; and Michel Mandjes

arXiv:1904.02966·math.PR·April 9, 2019

Rare Event Simulation for Steady-State Probabilities via Recurrency Cycles

Krzysztof Bisewski, Daan Crommelin, and Michel Mandjes

PDF

TL;DR

This paper introduces Recurrent Multilevel Splitting (RMS), a novel algorithm leveraging the recurrent structure of Markov chains to efficiently estimate rare event probabilities in steady-state systems, significantly outperforming traditional Monte Carlo methods.

Contribution

The paper presents RMS, a new algorithm that combines recurrence properties with multilevel splitting to improve rare event probability estimation in continuous-state Markov processes.

Findings

01

RMS achieves several orders of magnitude efficiency gains over Monte Carlo.

02

Numerical experiments validate RMS's effectiveness on complex stochastic models.

03

The method is applicable to systems with nonlinear dynamics and climate-like characteristics.

Abstract

We develop a new algorithm for the estimation of rare event probabilities associated with the steady-state of a Markov stochastic process with continuous state space $R^{d}$ and discrete time steps (i.e. a discrete-time $R^{d}$ -valued Markov chain). The algorithm, which we coin Recurrent Multilevel Splitting (RMS), relies on the Markov chain's underlying recurrent structure, in combination with the Multilevel Splitting method. Extensive simulation experiments are performed, including experiments with a nonlinear stochastic model that has some characteristics of complex climate models. The numerical experiments show that RMS can boost the computational efficiency by several orders of magnitude compared to the Monte Carlo method.

Tables4

Table 1. Table 1: RMS algorithm for an 1-dim OU process. Parameters: Q = 1 𝑄 1 Q=1 , A = { x 1 ≤ 0 } 𝐴 subscript 𝑥 1 0 A=\{x_{1}\leq 0\} , B = { x 1 ≥ u } 𝐵 subscript 𝑥 1 𝑢 B=\{x_{1}\geq u\} ; u 𝑢 u has been chosen using ( 5.11 ) to match the values of γ 𝛾 \gamma in the first row. We have α ^ A = 0.0225 subscript ^ 𝛼 𝐴 0.0225 \widehat{\alpha}_{A}=0.0225 and RE ( α ^ A ) = 1.66 ⋅ 10 − 3 RE subscript ^ 𝛼 𝐴 ⋅ 1.66 superscript 10 3 {\rm RE}(\widehat{\alpha}_{A})=1.66\cdot 10^{-3} .

$γ (u)$	$10^{- 3}$	$10^{- 4}$	$10^{- 5}$	$10^{- 6}$	$10^{- 7}$
$\hat{γ}$	$9.94 \cdot 10^{- 4}$	$9.93 \cdot 10^{- 5}$	$9.96 \cdot 10^{- 6}$	$9.96 \cdot 10^{- 7}$	$9.96 \cdot 10^{- 8}$
$RE (\hat{γ})$	3.95e-03	5.45e-03	6.53e-03	6.31e-03	5.49e-03
$Eff (\hat{γ})$	4.1	8.9	45.2	378.9	1836.2
$RE ({\hat{T}}_{B})$	3.90e-03	4.99e-03	6.42e-03	6.30e-03	5.32e-03

Table 2. Table 2: RMS algorithm for a 10-dim OU process. Parameters: Q 𝑄 Q is a matrix with only real eigenvalues, A = { x 1 ≤ 0 } 𝐴 subscript 𝑥 1 0 A=\{x_{1}\leq 0\} , B = { x 1 ≥ u } 𝐵 subscript 𝑥 1 𝑢 B=\{x_{1}\geq u\} ; u 𝑢 u has been chosen using ( 5.11 ) to match the values of γ 𝛾 \gamma in the first row. We have α ^ A = 0.0124 subscript ^ 𝛼 𝐴 0.0124 \widehat{\alpha}_{A}=0.0124 , RE ( α ^ A ) = 2.46 ⋅ 10 − 3 RE subscript ^ 𝛼 𝐴 ⋅ 2.46 superscript 10 3 {\rm RE}(\widehat{\alpha}_{A})=2.46\cdot 10^{-3} .

$γ (u)$	$10^{- 3}$	$10^{- 4}$	$10^{- 5}$	$10^{- 6}$	$10^{- 7}$
$\hat{γ}$	$1.00 \cdot 10^{- 3}$	$9.95 \cdot 10^{- 5}$	$1.02 \cdot 10^{- 5}$	$9.92 \cdot 10^{- 7}$	$1.00 \cdot 10^{- 7}$
$RE (\hat{γ})$	7.84e-03	1.03e-02	1.35e-02	1.12e-02	1.49e-02
$Eff (\hat{γ})$	0.8	2.4	9.3	34.9	180.5
$RE ({\hat{T}}_{B})$	7.87e-03	1.02e-02	1.35e-02	1.12e-02	1.49e-02

Table 3. Table 3: RMS algorithm applied to 2-dim OU process. Parameters: Q ( θ ) 𝑄 𝜃 Q(\theta) as in ( 5.12 ), A = { x 1 ≤ 0 } 𝐴 subscript 𝑥 1 0 A=\{x_{1}\leq 0\} , B = { x 1 ≥ u } 𝐵 subscript 𝑥 1 𝑢 B=\{x_{1}\geq u\} ; u 𝑢 u has been chosen depending on θ 𝜃 \theta such that in every case γ ( u ) = 10 − 6 𝛾 𝑢 superscript 10 6 \gamma(u)=10^{-6} .

$θ$	0.5	1	1.5	2	3
$\hat{γ}$	$9.91 \cdot 10^{- 7}$	$1.00 \cdot 10^{- 6}$	$1.00 \cdot 10^{- 6}$	$9.73 \cdot 10^{- 7}$	$9.60 \cdot 10^{- 7}$
$RE (\hat{γ})$	8.20e-03	1.05e-02	2.34e-02	2.66e-02	4.01e-02
$Eff (\hat{γ})$	31.9	27.9	7.1	5.8	1.0
$RE ({\hat{T}}_{B})$	7.63e-03	1.05e-02	2.37e-02	2.67e-02	4.01e-02

Table 4. Table 4: RMS algorithm applied to the model of Franzke ( 2012 ) . Parameters: A = { x 1 ≤ 7.9 } 𝐴 subscript 𝑥 1 7.9 A=\{x_{1}\leq 7.9\} , B = { x 1 > u } 𝐵 subscript 𝑥 1 𝑢 B=\{x_{1}>u\} . We have α ^ A = 0.0124 subscript ^ 𝛼 𝐴 0.0124 \widehat{\alpha}_{A}=0.0124 , RE ( α ^ A ) = 2.83 ⋅ 10 − 3 RE subscript ^ 𝛼 𝐴 ⋅ 2.83 superscript 10 3 {\rm RE}(\widehat{\alpha}_{A})=2.83\cdot 10^{-3} .

$u$	14	15	16	17.5	18.5
$\hat{γ}$	$1.08 \cdot 10^{- 3}$	$1.99 \cdot 10^{- 4}$	$3.00 \cdot 10^{- 5}$	$1.14 \cdot 10^{- 6}$	$9.78 \cdot 10^{- 8}$
$RE (\hat{γ})$	6.1e-03	7.2e-03	7.4e-03	7.4e-03	5.8e-03
${\hat{γ}}^{MC}$	$1.08 \cdot 10^{- 3}$	$2.00 \cdot 10^{- 4}$	$2.98 \cdot 10^{- 5}$	$1.12 \cdot 10^{- 6}$	$8.85 \cdot 10^{- 8}$
$RE ({\hat{γ}}^{MC})$	1.4e-03	2.9e-03	6.5e-03	2.7e-02	8.5e-02
$Eff (\hat{γ})$	1.9	8.6	32.1	269.9	1521.8
$RE ({\hat{T}}_{B})$	5.1e-03	6.4e-03	7.2e-03	6.6e-03	5.4e-03

Equations207

γ := μ (B) = N \to \infty lim \frac{1}{N} n = 1 \sum N \mathds 1 {X_{n} \in B}

γ := μ (B) = N \to \infty lim \frac{1}{N} n = 1 \sum N \mathds 1 {X_{n} \in B}

γ_{MC} := \frac{1}{N} n = 1 \sum N \mathds 1 {X_{n} \in B},

γ_{MC} := \frac{1}{N} n = 1 \sum N \mathds 1 {X_{n} \in B},

P (X_{n + 1} \in A ∣ X_{n} = x) = \int_{A} P (x, d y)

P (X_{n + 1} \in A ∣ X_{n} = x) = \int_{A} P (x, d y)

μ (A) = \int_{R^{d}} μ (d x) P (x, A) .

μ (A) = \int_{R^{d}} μ (d x) P (x, A) .

S_{k} := in f {n > S_{k - 1} : X_{n - 1} \neq \in A, X_{n} \in A} .

S_{k} := in f {n > S_{k - 1} : X_{n - 1} \neq \in A, X_{n} \in A} .

\mathcal{C}_{k}:=\big{(}X_{n}:S_{k-1}\leq n<S_{k}-1\big{)}.

\mathcal{C}_{k}:=\big{(}X_{n}:S_{k-1}\leq n<S_{k}-1\big{)}.

L_{k} := S_{k} - S_{k - 1}, X_{k}^{A} := X_{S_{k - 1}} .

L_{k} := S_{k} - S_{k - 1}, X_{k}^{A} := X_{S_{k - 1}} .

R_{k} := n = S_{k - 1} \sum S_{k} - 1 \mathds 1 {X_{n} \in B} .

R_{k} := n = S_{k - 1} \sum S_{k} - 1 \mathds 1 {X_{n} \in B} .

μ (B) = α_{A} \cdot T_{B}

μ (B) = α_{A} \cdot T_{B}

α_{k} := \frac{1}{M} n = (k - 1) M + 1 \sum k M \mathds 1 {X_{n - 1} \neq \in A, X_{n} \in A},

α_{k} := \frac{1}{M} n = (k - 1) M + 1 \sum k M \mathds 1 {X_{n - 1} \neq \in A, X_{n} \in A},

α_{A}^{BM} := \frac{1}{m} k = 1 \sum m α_{k} .

α_{A}^{BM} := \frac{1}{m} k = 1 \sum m α_{k} .

m (α_{A}^{BM} - α) / s_{BM} d t_{m - 1},

m (α_{A}^{BM} - α) / s_{BM} d t_{m - 1},

(X_{0}^{(1)}, X_{1}^{(1)}), \dots, (X_{0}^{(M)}, X_{1}^{(M)})

(X_{0}^{(1)}, X_{1}^{(1)}), \dots, (X_{0}^{(M)}, X_{1}^{(M)})

α_{A}^{MC} := \frac{1}{M} i = 1 \sum M \mathds 1 {X_{0}^{(i)} \neq \in A, X_{1}^{(i)} \in A}

α_{A}^{MC} := \frac{1}{M} i = 1 \sum M \mathds 1 {X_{0}^{(i)} \neq \in A, X_{1}^{(i)} \in A}

M (α_{A}^{MC} - α) / s_{MC} d N (0, 1),

M (α_{A}^{MC} - α) / s_{MC} d N (0, 1),

τ_{B} := in f {n > 0 : X_{n} \in B}, τ_{A}^{in} := S_{1} = in f {n > 0 : X_{n - 1} \neq \in A, X_{n} \in A},

τ_{B} := in f {n > 0 : X_{n} \in B}, τ_{A}^{in} := S_{1} = in f {n > 0 : X_{n - 1} \neq \in A, X_{n} \in A},

R_{+}:\stackrel{{\scriptstyle\mbox{\rm\tiny d}}}{{=}}\big{(}R_{1}\,|\,R_{1}>0)

R_{+}:\stackrel{{\scriptstyle\mbox{\rm\tiny d}}}{{=}}\big{(}R_{1}\,|\,R_{1}>0)

E (R_{1}) = P (R_{1} > 0) \cdot E (R_{1} ∣ R_{1} > 0) .

E (R_{1}) = P (R_{1} > 0) \cdot E (R_{1} ∣ R_{1} > 0) .

T_{B} = P (τ_{B} < τ_{A}^{in}) \cdot E (R_{1} ∣ τ_{B} < τ_{A}^{in}) = p_{B} \cdot E R_{+}

T_{B} = P (τ_{B} < τ_{A}^{in}) \cdot E (R_{1} ∣ τ_{B} < τ_{A}^{in}) = p_{B} \cdot E R_{+}

0 = ℓ_{0} < ℓ_{1} < \dots < ℓ_{m} = 1,

0 = ℓ_{0} < ℓ_{1} < \dots < ℓ_{m} = 1,

τ_{k} := in f {n \geq 0 : H (X_{n}) \geq ℓ_{k}}, D_{k} := {τ_{k} < τ_{A}^{in}};

τ_{k} := in f {n \geq 0 : H (X_{n}) \geq ℓ_{k}}, D_{k} := {τ_{k} < τ_{A}^{in}};

p_{k} := P (D_{k} ∣ D_{k - 1}), k \in {1, \dots, m},

p_{k} := P (D_{k} ∣ D_{k - 1}), k \in {1, \dots, m},

p_{B} = k = 0 \prod m p_{k} .

p_{B} = k = 0 \prod m p_{k} .

X_{k}^{i} := X_{τ_{k}^{i}}^{i} .

X_{k}^{i} := X_{τ_{k}^{i}}^{i} .

p_{B} := \frac{r _{m}}{\prod _{k = 0}^{m - 1} n _{k}} .

p_{B} := \frac{r _{m}}{\prod _{k = 0}^{m - 1} n _{k}} .

R_{+}^{(j)} := k = τ_{m} \sum τ_{A}^{in} - 1 \mathds 1 {X_{k} \in B} .

R_{+}^{(j)} := k = τ_{m} \sum τ_{A}^{in} - 1 \mathds 1 {X_{k} \in B} .

r_{m + 1} := j = 1 \sum r_{m} n_{m} R_{+}^{(j)}

r_{m + 1} := j = 1 \sum r_{m} n_{m} R_{+}^{(j)}

T_{B} := \frac{r _{m + 1}}{\prod _{k = 0}^{m} n _{k}}

T_{B} := \frac{r _{m + 1}}{\prod _{k = 0}^{m} n _{k}}

E T_{B}

E T_{B}

\mathbb{E}\bigg{(}\frac{r_{k}}{\prod_{i=0}^{k-1}n_{i}}\bigg{)}=\mathbb{P}(D_{k})=p_{1}\cdots p_{k}.

\mathbb{E}\bigg{(}\frac{r_{k}}{\prod_{i=0}^{k-1}n_{i}}\bigg{)}=\mathbb{P}(D_{k})=p_{1}\cdots p_{k}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Rare Event Simulation for Steady-State Probabilities

via Recurrency Cycles††thanks: This article may be downloaded for personal use only. Any other use requires prior permission of the author and AIP Publishing. This article appeared in K. Bisewski et al., Chaos: An Interdisciplinary Journal of Nonlinear Science 29, no. 3 (2019): 033131. and may be found at https://doi.org/10.1063/1.5080296.

Krzysztof Bisewski111Email: [email protected]

Centrum Wiskunde & Informatica, Amsterdam

Daan Crommelin

Centrum Wiskunde & Informatica, Amsterdam

Korteweg de Vries Institute for Mathematics, University of Amsterdam

Michel Mandjes

Korteweg de Vries Institute for Mathematics, University of Amsterdam

Abstract

We develop a new algorithm for the estimation of rare event probabilities associated with the steady-state of a Markov stochastic process with continuous state space $\mathbb{R}^{d}$ and discrete time steps (i.e. a discrete-time $\mathbb{R}^{d}$ -valued Markov chain). The algorithm, which we coin Recurrent Multilevel Splitting (RMS), relies on the Markov chain’s underlying recurrent structure, in combination with the Multilevel Splitting method. Extensive simulation experiments are performed, including experiments with a nonlinear stochastic model that has some characteristics of complex climate models. The numerical experiments show that RMS can boost the computational efficiency by several orders of magnitude compared to the Monte Carlo method.

1 Introduction

Many stochastic processes have a ‘stable regime’, in the sense that with time their distribution converges to a so-called steady-state. The steady-state (or stationary, equilibrium, ergodic) probability distribution captures the long-term behavior of the process; the steady-state probability of an arbitrary event (or set) $B$ is equal to the fraction of time the process spends in $B$ in the long run (irrespective of the process’ initial value). In many application domains steady-state probabilities are of crucial interest; think of physics (e.g. particle systems), chemistry (e.g. reaction networks), and operations research (e.g. queueing systems). Within this context of steady-state distributions, an important subdomain concerns the analysis of rare events. Particularly when it concerns rare events with a potentially catastrophic impact, there is a clear need to accurately estimate their likelihood (earthquakes, extreme weather conditions, simultaneous failure of multiple components of a machine, etc.). As examples, we refer to Ragone et al. (2018) for rare-event simulation methods in the climate context, and to Rubino and Tuffin (2009) for a textbook treatment covering applications in e.g. engineering, chemistry, and biology.

Despite the evident importance of being able to estimate steady-state rare-event probabilities, relatively little attention has been paid to the development of efficient algorithms; rare-event simulation in a finite-time horizon context received considerably more attention ( focusing e.g. on the estimation of the probability to hit a set $B_{1}$ before hitting another set $B_{2}$ ). The main contribution of this paper concerns the development of a broadly applicable rare-event simulation method that is tailored to the estimation of small steady-state probabilities.

In our setup we focus on discrete-time $\mathbb{R}^{d}$ -valued Markov chains. This framework covers a wide class of intensively used stochastic models. It for instance includes the numerical solutions to stochastic differential equations (SDEs), see e.g. Kloeden and Platen (1992). In addition, various (inherently discrete-time) standard models from e.g. finance, biology, and econometrics fall under this umbrella. The main advantage of our proposed algorithm is its broad applicability, the fact that it does not require detailed knowledge of the system under study, and that it is fairly straightforward to implement. In the sequel, we let $(X_{n})_{n\in\mathbb{N}}$ be our $d$ -dimensional Markov chain, which we assume to admit the stationary distribution $\mu$ . We are interested in the probability that in steady-state the process attains a value in the set $B$ , i.e.,

[TABLE]

Throughout, the event $B$ is assumed to be rare, entailing that $\gamma$ is very small, typically of order $10^{-4}$ or less (depending on the application at hand).

Our interest lies in estimating rare-event probabilities in the context of models, so in principle we can do more than applying statistical methods of extreme value analysis to model data; cf. Coles et al. (2001) for a textbook on Extreme Value Analysis. In our setup, the steady-state distribution is not explicitly known; one therefore has to resort to simulation. The naïve, Monte Carlo estimator for $\gamma$ is

[TABLE]

i.e., the average number of visits to set $B$ until time $N$ , which is known to be extremely inefficient when $B$ is rare; see e.g. Asmussen and Glynn (2007). Informally, one needs prohibitively many samples in order to obtain a reasonably accurate estimate of $\gamma$ ; the number of samples required to obtain an estimate of given precision is inversely proportional to $\gamma$ . In many cases, especially while working with complex or high-dimensional systems, where the integration of the model is time consuming, such computation might not be feasible.

An additional complication is that sampling directly from the steady-state distribution can be challenging. In our new method, we settle this issue by dissecting the paths of the underlying Markov chain into recurrency cycles. For an arbitrary set $A$ , we say that a recurrency occurs each time $(X_{n})_{n\in\mathbb{N}}$ crosses $A$ inwards, i.e., each time the event $\{X_{n-1}\not\in A,X_{n}\in A\}$ occurs. Assuming the process is in stationarity, $\gamma$ is equal to the average amount of time spent in $B$ between two visits to the set $A$ , divided by the average length of a recurrency cycle.

An example of a recurrency cycle is shown in Figure 1. It starts at $P_{1}$ and ends at $P_{5}$ ; the time spent in set $B$ is the time spent between states $P_{3}$ and $P_{4}$ . Note that recurrency is defined with respect to $A$ ; it is not necessary that the system enters $B$ during a recurrency cycle.

In our algorithm we separately estimate the numerator (expected time spent in $B$ during a single recurrency cycle) and the denominator (expected length of a single recurrency cycle). Here, two challenges arise. The first concerns the choice of the set $A$ . Any $A$ could in principle be used, but in order to maximize the efficiency of the algorithm, it should be chosen so as to minimize the expected time spent between visits to the set $A$ . The second challenge is posed by the rarity of visiting $B$ within a cycle. To tackle this issue, we propose the use of Multilevel Splitting (MLS), see Garvels (2000), Rubino and Tuffin (2009), but we remark that instead of MLS other methods could be chosen. These alternatives include Genealogical Particle Analysis (see e.g. Del Moral and Garnier (2005)), RESTART (see e.g. Villén-Altamirano and Villén-Altamirano (2011)), Adaptive Multilevel Splitting (see e.g. Cérou and Guyader (2007)), fixed-effort and fixed number of successes versions of Multilevel Splitting (see e.g. Amrein and Künsch (2011)) and Importance Sampling (see e.g. Heidelberger (1995)). We emphasize that we do not seek to compete with any of the aforementioned methods but rather introduce a new overarching framework, in which all these methods can be used to assess stationary performance metrics. We have chosen to work with MLS mostly for its conceptual simplicity and intuitive use.

The algorithm we propose is inspired by expressions for steady-state probabilities resulting from the theory of regenerative processes. Regeneration instances dissect the path of the process into probabilistically identical, independent segments. For regenerative processes we have that $\gamma$ equals the average amount of time spent in $B$ in a regeneration cycle divided by the average length of a regeneration cycle. For more background we refer to Crane and Iglehart (1975) and Asmussen (2008), or (in a more informal language) Henderson and Glynn (1999). In our setup, with its uncountable state space and a steady-state distribution potentially lacking atoms, we cannot straightforwardly construct regeneration points. We therefore develop an approach that relies on the recurrency cycles introduced above, so as to set up a scheme that yields probabilistically identical (but not necessarily independent) cycles. We refer to Goyal et al. (1992) for an algorithm corresponding to the setting in which the set $A$ consists of finitely many elements (which inspired us to develop our algorithm). We also mention that a large subclass of general (continuous) state-space Markov chains, called positive Harris, is regenerative. However, constructing regeneration cycles in this context is typically technically difficult, and in addition the implementation may be computationally inefficient due to excessively long cycle lengths; see Henderson and Glynn (2001).

The manuscript is organized as follows. In Section 2 we discuss preliminaries, such as basic theory of general state-space Markov chains. We also give an alternative representation of the parameter $\gamma$ based on the recurrent structure of a Markov Chain in Theorem 1. Relying on this alternative representation, in Section 3, we introduce a new algorithm for the estimation of $\gamma$ , which we coin Recurrent Multilevel Splitting (RMS). In Section 4, we establish (in a simplified setting) the optimal parameters for the RMS algorithm, and provide implementation-related guidelines. Theorem 3 in Appendix C establishes the asymptotic efficiency of the RMS algorithm. A technical derivation of the optimal parameters is given in Appendix B. In Section 5 we test the method on a set of numerical examples, we discuss which factors affect the method’s performance, and provide heuristics. Finally, in Section 6 we discuss possible extensions of the algorithm and give a summary. Appendix A consists of a collection of required technical results.

2 Preliminaries

Here we introduce concepts used later in Section 3 such as (Harris) recurrence, the stationary measure and recurrency cycles.

2.1 Continuous State-Space Markov Chains

In this subsection we provide some background on the (well-established) theory of stability of discrete-time Markov chains with a general (continuous) state-space. The underlying theory can be found in textbooks on Markov chains; our notation is in line with the one used in Meyn and Tweedie (2012).

The theory of stability for general state-space time-discrete Markov chains differs from the one for its finite (or countable) state-space counterpart. Due to the continuous state space, multiple visits to the same state may happen with probability 0. This explains why the classic notion of irreducibility and recurrence of states has been generalized to sets (rather than states). In this setting one typically works with the concept of so-called positive Harris recurrent chains: sets of states are guaranteed to be visited infinitely often, with in addition a finite expected return time. Effectively all Markov chains with an invariant probability distribution are positive Harris (with an exception of pathological, custom-made examples); see (Meyn and Tweedie, 2012, Section 9) for a rigorous treatment of the topic.

Let $(X_{n})_{n\in\mathbb{N}}$ be a Markov chain taking values in $\mathbb{R}^{d}$ with a transition kernel $P(x,{\rm d}y)$ , meaning that the distribution of $X_{n+1}$ conditional on $X_{n}=x$ is given by

[TABLE]

for measurable sets $A\subseteq\mathbb{R}^{d}$ . We denote $P(x,A):=\int_{A}P(x,{\rm d}y)$ . Then, the stationary distribution $\mu$ satisfies the relation

[TABLE]

For an arbitrary probability measure $\nu$ , we define the conditional probability and expectation by $\mathbb{P}_{\nu}(\cdot)=\mathbb{P}(\cdot\,|\,X_{0}\sim\nu)$ and $\mathbb{E}_{\nu}(\cdot)=\mathbb{E}(\cdot\,|\,X_{0}\sim\nu)$ , respectively. In particular, when $\nu$ corresponds to a point mass at $x$ , we use the compact notations $\mathbb{P}_{x}(\cdot)=\mathbb{P}(\cdot\,|\,X_{0}=x)$ and $\mathbb{E}_{x}(\cdot)=\mathbb{E}(\cdot\,|\,X_{0}=x)$ , respectively.

2.2 Recurrent Structure of a Markov Chain

As mentioned in the introduction, a large class of general state-space Markov chains (more specifically, the class of positive Harris recurrent Markov chains) allows a regenerative structure; see e.g. Henderson and Glynn (2001). However, for application purposes, it is often difficult to sample the regeneration times. Moreover, even when it is possible to sample these, the implementation is often inefficient due to the long cycle lengths — in fact, the regeneration may be a rare event itself.

There are many other ways to decompose a Markov chain into cycles. In this paper we propose to work with cycles that start with an inward crossing of a set $A$ (i.e., entering $A$ from the outside). We denote the time of the $(k+1)$ -th inward crossing by $S_{k}$ , i.e.,

[TABLE]

with $S_{-1}:=0$ . Then, we define the paths within the cycles through

[TABLE]

With a $k$ -th cycle we associate the cycle length and the cycle origin (or starting point),

[TABLE]

We call $A$ the recurrency set and $\mathcal{C}_{1},\mathcal{C}_{2},\ldots$ recurrency cycles. Under the assumption that the process $(X_{n})_{n\in\mathbb{N}}$ starts in a cycle-stationary regime (that is $X_{0}\sim\mu$ and $S_{0}=1$ .), the pairs $(\mathcal{C}_{1},L_{1}),$ $(\mathcal{C}_{2},L_{2}),\ldots$ are identically distributed. However, the cycles (2.2) are generally not independent, as two distinct cycle origins $X^{A}_{k}$ , $X^{A}_{m}$ separated by a short time period $S_{m-1}-S_{k-1}$ tend to be located within the same subregion of the recurrency set. Because of this dependence, the decomposition into recurrency cycles is neither classic nor wide sense regenerative, see Definition 3.1 and 3.3 in Kalashnikov (1994). The way we define cycles is a special case of the almost regenerative cycles introduced by Gunther and Wolff (1980). The interested reader is referred to the introduction of Calvin et al. (2006), where a more exhaustive account of different regeneration-type methods is outlined.

A single recurrency cycle reflects the behavior of the process in steady-state. To make this claim more precise, define the total time spent in the set $B$ within the $k$ -th cycle:

[TABLE]

Since (in a cycle-stationary regime) the cycles in (2.2) are identically distributed, so are $R_{1},R_{2},\ldots$ . The following theorem states that the total fraction of time that the process $(X_{n})$ spends in the set $B$ is proportional to the expected time spent in $B$ between two consecutive inward crossings into $A$ . Define the frequency of recurrence $\alpha_{A}:=\mathbb{P}_{\mu}(X_{0}\not\in A,X_{1}\in A)$ .

Theorem 1.

Let $(X_{n})_{n\in\mathbb{N}}$ be a positive Harris recurrent Markov chain and let $\mu$ denote its unique stationary probability measure. Let $A$ , $B$ be measurable sets such that $\mu(A)\in(0,1)$ . Let $L_{1}$ be as defined in (2.3), $R_{1}$ as defined in (2.4), and $T_{B}:=\mathbb{E}_{\mu}R_{1}$ . Then $\mathbb{E}_{\mu}L_{1}<\infty$ ,

[TABLE]

and $\alpha_{A}=(\mathbb{E}_{\mu}L_{1})^{-1}$ .

Proof.

See Appendix A. ∎

The factorization (2.5) of $\gamma$ from Theorem 1 is the starting point from which we develop our steady-state rare-event simulation algorithm in Section 3.

We note that an analogue of Theorem 1 holds for regenerative processes. Dissection of a Markov chain into regeneration cycles has one clear advantage over dissection into recurrency cycles, namely, the regeneration cycles are independent. Using this independence, one can easily infer the variance of an estimator based on regeneration cycles. Nonetheless, it is more attractive to use recurrency cycles than regeneration, as the latter is harder to implement and has a (much) longer expected cycle length. Moreover, in situations where it is possible to sample from the stationary distribution $\mu$ , one can simulate independent paths until the first recurrency cycle has ended, such that the resulting cycles will be independent as well.

3 Recurrent Splitting Algorithm

Our algorithm essentially relies on the result from Theorem 1, namely the representation of $\gamma$ as a product of two quantities. Thus, we divide our algorithm into two stages: first there is the estimation of $\alpha_{A}$ (the frequency of recurrence, equal to the reciprocal of the expected cycle length), and secondly the estimation of $T_{B}$ (the expected time spent in set $B$ within a recurrency cycle).

3.1 Estimation of $\alpha_{A}$

While it is relatively straightforward to estimate $\alpha_{A}$ (for example with a crude Monte Carlo method), the choice of the recurrency set $A$ is non-trivial. In this section we assume that $A$ has already been chosen; the choice of $A$ is discussed in Section 4.2.

In typical situations one can generate sample paths of $X_{n}$ by simulation but it is not possible to exactly sample from the stationary distribution. Even though the law of $X_{n}$ converges to $\mu$ weakly, as $n\to\infty$ , at any fixed time $n$ , the law of $X_{n}$ is not exactly $\mu$ . Perhaps the most straightforward method to estimate $\alpha_{A}$ in this setting is the method of batch-means. It relies on dissecting a path of the Markov chain of length $N$ into $m\in\mathbb{N}$ batches of equal length, and calculating the sample frequency of entering the set $A$ for each batch. More specifically, with $M:=[N/m]$ ,

[TABLE]

and then the batch-means estimator is

[TABLE]

Let $s_{\text{BM}}^{2}$ be the sample variance of $\widehat{\alpha}_{1},\ldots,\widehat{\alpha}_{m}$ and $t_{m-1}$ a Student’s t distribution with $m-1$ degrees of freedom. Then, due to the ‘near independence’ between the batches, under appropriate regularity assumptions,

[TABLE]

as $N\to\infty$ , with ‘ $\xrightarrow{\textnormal{d}}$ ’ denoting convergence in distribution. For more details and background, we refer to e.g. Asmussen and Glynn (2007).

We remark that when an exact sampling procedure from $\mu$ is available, then it might be more efficient to use the following Monte Carlo estimator. Generate $M$ independent pairs

[TABLE]

with (for all $i=1,..,M$ ) $X^{(i)}_{0}\sim\mu$ and $X^{(i)}_{1}$ distributed according to the dynamics of the Markov chain (2.1) conditional on the value of $X^{(i)}_{0}$ . The Monte Carlo estimator

[TABLE]

is unbiased, $\mathbb{V}\textnormal{{ar}}\,\widehat{\alpha}^{\text{MC}}_{A}=\alpha_{A}(1-\alpha_{A})/M$ , and, as $M\to\infty$ ,

[TABLE]

with $s_{\text{MC}}^{2}$ the sample variance.

Whether exact simulation from $\mu$ is available or not, both methods allow for the construction of confidence intervals based on the weak convergence results (3.2) and (3.4). It should be clear that the set $A$ should be chosen such that $\alpha_{A}$ is not prohibitively small, so that the methods (3.1) and (3.3) are computationally efficient. Otherwise, the estimation of $\alpha_{A}$ would be a rare event simulation problem itself (which we obviously want to avoid).

3.2 Estimation of $T_{B}$

The second stage of the algorithm concerns the estimation of $T_{B}$ , as defined in Theorem 1. This step is the more challenging one, as the quantity $T_{B}$ is expected to be very small. We resort to rare-event simulation methods. For clarity of exposition, throughout this section we assume that the chain $(X_{n})_{n\in\mathbb{N}}$ is stationary, $S_{0}=0$ and we drop the subscript in $\mathbb{P}_{\mu}$ and $\mathbb{E}_{\mu}$ (i.e., we write simply $\mathbb{P}$ and $\mathbb{E}$ , respectively). We also assume that we can sample from the distribution of the cycle starting point $X_{1}^{A}$ (note that $X_{1}^{A},X_{2}^{A},...$ are all identically distributed). If we can not, then we sample from $X_{1}^{A}$ approximately; this is discussed in Section 3.3. We first introduce some notation; we define $p_{B}:=\mathbb{P}(\tau_{B}<\tau^{\text{\rm in}}_{A})$ , with

[TABLE]

and

[TABLE]

with ‘ $\stackrel{{\scriptstyle\mbox{\rm\tiny d}}}{{=}}$ ’ denoting equality in distribution. Note that $\tau^{\text{\rm in}}_{A}-1$ marks the end of the first recurrency cycle. Since $\{R_{1}>0\}=\{\tau_{B}<\tau^{\text{in}}_{A}\}$ , $p_{B}$ is the probability of reaching $B$ within a cycle, and $R_{+}$ is a random variable distributed as the total time spent in the set $B$ within a cycle conditioned on the cycle reaching set $B$ . As was noted in Garvels (2000),

[TABLE]

This entails that

[TABLE]

The estimation of $p_{B}$ is a classic rare-event simulation problem, for which various methods have been developed. Following Garvels (2000), we propose to use a Multilevel Splitting (MLS) algorithm to estimate $T_{B}$ (but, as we mentioned before, other approaches could be followed as well). There are a number of variations of the MLS algorithm; we chose to rely on its simplest version (called ‘Fixed Splitting’). The following exposition aligns with Amrein and Künsch (2011).

As mentioned, the naïve Monte Carlo method is inefficient for the estimation of small probabilities, because of the computational effort wasted on simulating irrelevant paths. The core idea behind the MLS method is to split the path of the process when it approaches $B$ . This way, we have more control over the simulation, by forcing the process into interesting regions. In order to implement the MLS algorithm, one must first choose an importance function $H:\mathbb{R}^{d}\to[0,1]$ which assigns an importance value to every possible state. $H$ should be chosen such that $H(x)=1$ if and only if $x\in B$ and $H(x)=0$ for $x\in A$ . We postpone the discussion about the choice of the importance function to Section 4.2.

We now formally introduce the MLS algorithm. First divide the interval $[0,1]$ into $m$ subintervals with endpoints:

[TABLE]

and define the corresponding stopping times and events

[TABLE]

for $k\in\{0,\ldots,m\}$ . Note that $\tau_{k}$ is the first time an importance value greater or equal to $\ell_{k}$ has been reached; in particular $\tau_{m}=\tau_{B}$ and $\tau_{0}=0$ , so that $X_{\tau_{0}}\stackrel{{\scriptstyle\mbox{\rm\tiny d}}}{{=}}X^{A}_{1}$ . Finally let

[TABLE]

and $p_{0}=1$ , to which we refer as conditional probabilities. From the definition (3.7) we have $\mathbb{P}(D_{m})=p_{B}$ and since $D_{0}\subseteq D_{1}\subseteq\ldots\subseteq D_{m}$ , we conclude

[TABLE]

Finally, define splitting factors $n_{0},n_{1},\ldots,n_{m}\in\mathbb{N}$ , representing the number of independent continuations of the process that are sampled when reaching the respective importance levels. Here $n_{0}$ plays a special role, as it is a number of independent MLS estimators; the final estimator will be a mean of $n_{0}$ independent MLS estimators. By virtue of this independence, we are able to estimate the variance of the final estimator. For simplicity, in the following it is assumed that $n_{0}=1$ .

Algorithm 1 (Multilevel Splitting).

Set $k:=0$ , $r_{0}:=1$ , sample $X^{1}_{0}\sim X^{A}_{1}$ . 2. 2.

In ** the* $k$ -th stage we have a sample of $r_{k}$ entrance states $(X^{1}_{k},\ldots,X^{r_{k}}_{k})$ , where we denote*

[TABLE]

For each state $X^{i}_{k}$ generate $n_{k}$ independent path continuations until $\min\{\tau_{k+1},\tau^{\text{\rm in}}_{A}\}$ . The number of paths for which the event $D_{k+1}$ occurred is denoted by $r_{k+1}$ . Store all $r_{k+1}$ states $X^{i}_{k+1}$ , for which the event $D_{k+1}$ occurred, in memory. 3. 3.

If $r_{k+1}=0$ , then stop the algorithm and put $\widehat{p}_{B}:=0$ , $\widehat{T}_{B}:=0$ . 4. 4.

If $k<m-1$ , then increase $k$ by one and go back to step 2; otherwise put

[TABLE] 5. 5.

If $r_{m}=0$ , then return $\widehat{T}_{B}=0$ ; otherwise, for each state $X^{i}_{m}$ generate $n_{m}$ independent path continuations until $\tau^{\text{\rm in}}_{A}$ . For each of these $r_{m}n_{m}$ continuations ** record* the time spent in set $B$ :*

[TABLE]

Calculate ** the* total time spent in $B$ by*

[TABLE] 6. 6.

The final estimator is

[TABLE]

Theorem 2.

The estimators $\widehat{p}_{B}$ and $\widehat{T}_{B}$ , as defined in (3.8) and (3.9), are unbiased estimators for $p_{B}$ and $T_{B}$ respectively.

The following proof is based on notes of the Summer School in Monte Carlo Methods for Rare Events that took place at Brown University, Providence RI, USA in June 2016 (authored by J. Blanchet, P. Dupuis, and H. Hult). It is noted that various alternative derivations can be constructed; see e.g. Asmussen and Glynn (2007).

Proof of Theorem 2.

Let $\overline{X}_{i,j}$ be labeling all descendants of the original particle, with $i$ indexing time and $j$ indexing the descendant. All descendants $\overline{X}_{\cdot,j}$ are identically distributed (but not independent). Now suppose that each particle has an evolving weight $w_{i,j}$ . Concretely, this means that when a particle crosses a threshold $\ell_{k}$ , it is split into $n_{k}$ particles and its weight is divided equally among its descendants (i.e., each of them obtaining a share $1/n_{k}$ of $w_{i,j}$ ). Each particle that reaches the set $B$ has been split $m$ times, and its weight is thus $1/\prod_{k=1}^{m}n_{k}$ . For particles that did not reach set $B$ , we artificially split these particles (keeping them in $A$ ) for the remaining thresholds so that the total number of particles is $\prod_{k=1}^{m}n_{k}$ , each of equal weight. Then, using the fact that the descendants are identically distributed, we obtain

[TABLE]

Analogously, $\mathbb{E}\widehat{p}_{B}=p_{B}$ , which ends the proof. ∎

We remark that, with $r_{1},\ldots,r_{m}$ as defined in Algorithm 1, the same arguments as the ones featuring in the proof of Thm. 2 imply the unbiasedness of the estimators for $\mathbb{P}(D_{k})$ :

[TABLE]

3.3 Estimation of $\gamma$

As already mentioned at the beginning of Section 3, the final estimator for $\gamma$ is the product $\widehat{\gamma}:=\widehat{\alpha}_{A}\cdot\widehat{T}_{B}$ . In the description the MLS algorithm, in Step 1, we tacitly assumed that we can sample the recurrency cycle origin $X^{A}_{1}$ . As this is typically not the case, we sample $X_{1}^{A}$ approximately, in the following way. During the estimation of $\alpha_{A}$ with the batch-means method (3.1) we store each inwards crossing to the set $A$ and we bootstrap these states in Step 1 of Algorithm 1. We thus end up with the following algorithm for estimating the rare-event probability $\gamma$ , as defined in (1.1).

Algorithm 2 (Recurrent Multilevel Splitting).

Choose a recurrency set $A$ satisfying the assumptions of Theorem 1 and an importance function $H:\mathbb{R}^{d}\to[0,1]$ . 2. 2.

Estimate $\alpha_{A}$ using the batch-means method $\eqref{eq:alpha_BM}$ , and return $\widehat{\alpha}_{A}$ . Store the locations of the cycle origins in the set $\mathcal{S}_{\text{\rm rec}}:=\{X^{A}_{1},X^{A}_{2},\ldots\}$ . 3. 3.

Estimate $T_{B}$ using the Multilevel Splitting algorithm (Algorithm 1); in Step 1 sample the origin $X_{0}^{1}$ uniformly from $S_{\rm rec}$ . The output is $\widehat{T}_{B}$ . 4. 4.

The final estimator is

[TABLE]

It is assumed that the set $S_{\text{rec}}$ is ‘representative enough’ to make sure that resampling from $S_{\text{rec}}$ can be interpreted as taking i.i.d. samples of $X^{A}_{1}$ in the stationary regime. Under this assumption, the estimators $\widehat{\alpha}_{A}$ , $\widehat{T}_{B}$ are independent and the variance of $\widehat{\gamma}$ can be inferred using the sample variance of $\widehat{\alpha}_{A}$ and $\widehat{T}_{B}$ . However, in our numerical experiments in Section 5 we do not assume this independence to get an estimate of the variance. Instead we run Algorithm 2 multiple times, resulting in multiple estimates $\widehat{\gamma}$ from which we obtain a reliable estimate for the variance of $\widehat{\gamma}$ . For implementation details, see Section 5.1.

4 Choice of Parameters

In a rare-event setting, both the expectation and the variance of an estimator are very small, so that the variance itself is not a meaningful measure of accuracy. Instead, it makes sense to look at its value relative to the expectation, i.e., the Relative Error (RE):

[TABLE]

An estimator with a lower relative error is not necessarily preferred; a more meaningful criterion involves the corresponding total computational time (or: workload), which we denote $W(\widehat{\gamma})$ ; see the beginning of Section 5.1 for more details. In the following section we consider a setting, in which we can derive optimal parameters of the MLS estimator by minimizing the workload under a constraint on the relative error (i.e., ${\rm RE}^{2}(\widehat{\gamma})\leq\rho$ for a given accuracy $\rho>0$ ).

4.1 Simplified Setting

Due to possible dependencies between the number of successes $r_{1},\ldots,r_{m}$ , there is no tractable general expression for the variance of MLS estimator. A typical assumption made in the literature is to assume some sort of independence between them, and to study the variance afterwards. With $\tau_{k},D_{k}$ defined as in (3.7) and $R_{+}$ as defined in (3.5), we assume

${\rm(I)}$

for all $k\in\{1,\ldots,m\}$ ,

[TABLE] 2. ${\rm(II)}$

for all $X_{\tau_{m}}$ ,

[TABLE]

Assumption ( ${\rm I}$ ) has been proposed in Amrein and Künsch (2011). It states that the probability of reaching the $k$ -th importance level, given the $(k-1)$ -st level has been reached, is constant over all possible entrance states. Assumption ( ${\rm II}$ ) states that the time spent in the rare set $B$ within a cycle, conditioned on the set $B$ has been reached, does not depend on the position of the entrance state to $B$ . In principle, we have the possibility to choose the set $A$ and the importance function $H(\cdot)$ such that Assumption ( ${\rm I}$ ) is satisfied; see the discussion in Section 4.2. Whether Assumption ( ${\rm II}$ ) holds or not is effectively problem specific, in the sense that we do not have control over it due to the fact that the set $B$ is given. We argue that for a large class of problems there exists a most likely point of entry $X_{\tau_{B}}$ to $B$ , which implies ( ${\rm II}$ ) approximately. We emphasize that Assumptions ( ${\rm I}$ - ${\rm II}$ ) are not required for the RMS algorithm to work, but if they are fulfilled, optimality results can be derived. Under ( ${\rm I}$ - ${\rm II}$ ) we find the squared relative error of $\widehat{T}_{B}$ :

[TABLE]

We derive (4.1) in Appendix A. Following the approach of Amrein and Künsch (2011), in Appendix B we derive the optimal parameters $m,p_{1},\ldots,p_{m},n_{0},\ldots,n_{m}$ for the MLS algorithm; here, optimality refers to the property that the expected computational time is minimized under the constraint for the relative error ${\rm RE}^{2}(\widehat{T}_{B})\leq\rho$ for a given accuracy $\rho>0$ . It is worth noting that the optimal number of thresholds $m$ is roughly equal to $|\log p_{B}|$ with conditional probabilities $p_{k}$ all equal to approximately $0.2$ . What is more, the optimal solution satisfies $n_{k}p_{k+1}=1$ for $k\in\{1,\ldots,m-1\}$ , so we can choose $n_{k}=5$ . This so-called balanced growth (see Garvels (2000)) ensures that, on average, $n_{0}$ paths are sampled in each stage of the algorithm (with an exception of the last stage, which corresponds to the estimation of $R_{+}$ ). The optimal workload reads

[TABLE]

with a constant $c$ defined as below display (B.2). As already mentioned, a rigorous derivation of this result can be found in Appendix B, and the exact values of the optimal parameters $m,p_{1},\ldots,p_{m},n_{0},\ldots,n_{m}$ in Eq. (B.2). In all our numerical experiments in Section 5, we spend an initial portion of computational time on a rough estimation of $p_{B}$ and ${\rm RE}(R_{+})$ in order to find a sufficiently accurate approximation of the optimal parameters. See Section 5.1 for a more detailed account of the implementation details.

The optimal workload in (4.2) is proportional to $(\log p_{B})^{2}$ , which offers a huge gain in efficiency, compared with the Monte Carlo method (C.4) (whose workload is inversely proportional to $p_{B}$ ). We derive efficiency results in Appendix C; in particular, Theorem 3 proves that RMS is logarithmically efficient under specific assumptions.

4.2 Choice of Recurrency Set and Importance Function

In Section 4.1 we have seen that under Assumptions ( ${\rm I}$ - ${\rm II}$ ), the MLS method is particularly efficient. As already mentioned, the level up to which Assumption ( ${\rm I}$ ) is fulfilled depends on both the choice of the recurrency set and the importance function; we thus aim to choose $A$ and $H(\cdot)$ in such a way that ( ${\rm I}$ ) is approximately satisfied. At the same time, we would like to choose $A$ so as to maximize $\alpha_{A}$ , so that the batch-means estimator $\widehat{\alpha}_{A}$ (as defined in (3.1)) is computationally efficient as well. These two requirements are often conflicting and one must in the end strike a proper balance between them.

For each $k$ , Assumption ( ${\rm I}$ ) concerns the choices of both $A$ and $H(\cdot)$ . However, it implies a property that relates to the choice of $A$ only, namely, the probability of reaching set $B$ within a recurrency cycle is independent of the initial point:

[TABLE]

Thus, Assumption ( ${\rm I}$ ) implies that

[TABLE]

informally, there is independence between the origin of the cycle on one hand, and the random variable $\mathds{1}\{R_{1}>0\}$ (indicating whether set $B$ has been reached within a cycle) on the other hand. Intuitively, the smaller the set $A$ is, the more closely (4.3) is satisfied but also, the smaller $\alpha_{A}$ is. In particular, (4.3) trivially holds when $A$ consists of one point only, but then $\alpha_{A}=0$ . In Section 5.2.3 we give an example of a setting in which (4.3) is violated, but one can imagine that in many situations (4.3) ‘roughly holds’. Thus, for practical purposes, it is desirable that the set $A$ maximizes $\alpha_{A}$ while it also approximately satisfies (4.3). In full generality, it is not an easy task to fulfill both aims.

A poorly chosen importance function will lead the split particles into uninteresting regions, or it will force the paths to hit the rare set in an unlikely fashion. This potentially leads to low efficiency of the MLS algorithm. Given that we have already chosen a set $A$ satisfying (4.3), there exists an importance function guaranteeing ( ${\rm I}$ ) to be satisfied:

[TABLE]

Of course this insight is of theoretical value only: if we knew the quantity on the right hand side, then we would not even have to use the MLS algorithm. However, also

[TABLE]

with $g:[0,1]\to\mathbb{R}$ any increasing function, satisfies ( ${\rm I}$ ). This already gives a helpful guideline for the choice of $H$ . Namely, the states from which it is more likely to visit $B$ before returning to $A$ should have larger importance. When an approximation or asymptotic behavior of $\mathbb{P}_{x}(\tau_{B}\leq\tau_{A}^{\text{in}})$ is available it might be useful to use it as an importance function. In Dean and Dupuis (2009) a large-deviations based approach to the choice of importance function is discussed.

Sometimes, a so-called distance-based importance function can be a good choice. This function is basically

[TABLE]

normalized in such a way that $H(x)=1$ iff $x\in B$ and $H(x)=0$ for $x\in A$ . This importance function can be a good choice for systems whose paths conditioned on $\{\tau_{B}<\tau^{\text{in}}_{A}\}$ are effectively gradually driven towards $B$ . In contrast, distance-based importance function will be a poor choice for systems for which it is most likely to reach rare set $B$ by first getting away from it. In Section 5 we include examples of problems for which a distance-based importance function is a good choice, but also one in which it does not work well.

In some cases we may have already chosen a particular shape of the set $A$ (e.g. an ellipsoid, half-space, or multidimensional cube) which can be parametrized by a single parameter $\ell\in\mathbb{R}$ . Even better, if we have already chosen an importance function, then a level set

[TABLE]

could be a good choice. In any case, we should choose $\ell$ to maximize $\alpha_{A(\ell)}$ . We propose to use a crude estimator to find $\ell^{*}$ : we find a maximizer of $\alpha_{A(\ell)}$ by putting

[TABLE]

Quantile validation. While it is not clear in general how to choose $A$ such that it satisfies (4.3), one can statistically test whether (4.3) holds after the choice of $A$ has been made. We now propose one particular method to do so that can be used in combination with the RMS algorithm. In Step 2 of Algorithm 2 calculate and store the maximum importance attained within cycles, i.e.,

[TABLE]

with $\mathcal{C}_{k}$ as defined in (2.2). Assuming a good importance function has been chosen, the cycle origins corresponding to the highest importance should also be approximately distributed as $X^{A}_{+}$ . This gives us means of comparing the distributions of $X^{A}_{1}$ and $X^{A}_{+}$ . Let $N_{\text{rec}}$ be the total number of pairs $(X^{A}_{k},H^{\text{max}}_{k})$ obtained in Step 2 of Algorithm 2. Let

[TABLE]

be a permutation ordering $(H^{\text{max}}_{k})_{1\geq k\geq N_{\text{rec}}}$ into a non-decreasing sequence, i.e.,

[TABLE]

Now choose a $q\in(0,1)$ and let

[TABLE]

That is, $S^{q}_{\text{rec}}$ is a subset of $S_{\text{rec}}$ which contains the cycle origins corresponding to the fraction $q$ of values with highest importance. In particular $S^{1}_{\text{rec}}=S_{\text{rec}}$ . Then $S_{\text{rec}}$ and $S_{\text{rec}}^{q}$ (for small $q$ ) can be thought of as sets of samples from the random variables $X^{A}_{1}$ and $X^{A}_{+}$ , respectively. Various tests can now be performed, to compare e.g. the means or variances; alternatively QQ-plots can be made, or histograms can be compared.

5 Numerical Experiments

The aim of this section is to test the RMS method on a series of specific examples. The examples range from simple cases, where the ground truth is known, to more complicated dynamical systems, where the ground truth is unknown and we can only compare to estimates obtained with Monte Carlo (MC) methods. In Section 5.2.3 we also carefully look into an example where the RMS method (with a naïve choice of the importance function) does not perform that well; we discuss why this was to be expected. It will be seen throughout that RMS is superior to MC in terms of the computational time needed to achieve a desired level of accuracy; in extreme cases, like in Section 5.3, the RMS method can be three orders of magnitude faster than MC (and the efficiency gain is expected to be even greater as $\gamma$ decreases).

5.1 Implementation Details

As already mentioned in Section 4, the relative error of an estimator is not always a meaningful measure of its performance, as it does not take the workload into account. We therefore compare RMS with MC using the ratio of work normalized squared relative errors; see e.g. Kroese et al. (2013). In particular, we define

[TABLE]

This value can be interpreted as the ratio of the computational cost of MC to the cost of RMS when both methods reach the same accuracy (same relative error). Clearly, the larger ${\rm Eff}(\widehat{\gamma})$ is, the more efficient the RMS method is in comparison with Monte Carlo.

In each of our experiments, the underlying Markov chain $(X_{n})_{n\in\mathbb{N}}$ represents the numerical solution to a $d$ -dimensional Stochastic Differential Equation (SDE) using an explicit Euler scheme, with time step $h>0$ ; see e.g. Kloeden and Platen (1992). We remark that the time discretization potentially has a significant effect on a the underlying value of $\gamma$ , especially in the rare-event setting; see the recent systematic study Bisewski et al. (2018). However, in the context of this article we only focus on discrete recursions that arise from numerical time integration schemes. For these recursions we compare RMS with the corresponding Monte Carlo results; we do not aim at studying the behavior as $h\downarrow 0$ .

Notice that our method relies on properties of discrete-time processes, in particular in the definition of the recurrency cycles. More specifically, in the corresponding continuous-time model recurrency cycles are ill-defined, as a set may be entered and left infinitely often in a time interval of finite length. This feature could potentially lead to computational issues when working with a small time step $h$ . However, one can easily circumvent the problem and still integrate the process with arbitrarily small $h_{0}$ but store values every $h>h_{0}$ . Note that the discretization error depends only on $h_{0}$ (and not $h$ ), since $h_{0}$ determines the stationary distribution. In fact, this is what we do in Section 5.3, where the process is integrated with $h_{0}=10^{-4}$ but it is stored only every $h=10^{-2}$ .

In each experiment the rare event $B$ is a half-space parametrized by $u\in\mathbb{R}$ :

[TABLE]

In other words, the probability under consideration corresponds to the the first dimension attaining high values in stationarity:

[TABLE]

for large $u$ . Furthermore, in each experiment we choose the recurrency set $A$ to be a half-space parametrized by $\ell$ ( where the value of $\ell$ is chosen depending on the particular experiment):

[TABLE]

We use a distance-based importance function, i.e.,

[TABLE]

We now provide more details on our implementation of Algorithm 2. In Step 2, we estimate $\alpha_{A}$ using the method of batch means as in (3.1); the number of iterations of the Markov chain $N$ is chosen such that $S_{\rm rec}$ consists of roughly $10^{4}$ inwards crossings of $A$ . In Step 3, we want to choose parameters $m,n_{0},\ldots,n_{m},\ell_{1},\ldots,\ell_{m}$ for the Multilevel Splitting in such a way that the workload is minimized and the resulting estimator satisfies

[TABLE]

We run a pilot MLS with many intermediate thresholds ( $m=20$ ). The pilot gives us rough estimates of $p_{B}$ , $T_{B}$ and ${\rm RE}(R_{+})$ . We put the number of thresholds $m$ and splitting factors $n_{0},\ldots,n_{m}$ as in (B.2); we emphasize that the optimal $n_{0}$ is also determined by the desired squared relative error $\rho$ . We find the intermediate thresholds $\ell_{1},\ldots,\ell_{m}$ following the log-linear interpolation approach from Wadman et al. (2014). Assuming ( ${\rm I}$ - ${\rm II}$ ) are satisfied, the MLS method with these parameters should give the desired relative error, as in (5.5). We note that in the pilot we use the variant of MLS called ‘Fixed Number of Successes’ developed by Amrein and Künsch (2011).

The final estimator $\widehat{\gamma}$ is the mean of $N=100$ independent replicas $\widehat{\gamma}^{(1)},\ldots,\widehat{\gamma}^{(N)}$ of the RMS estimator (3.11) with parameters as discussed above; i.e.

[TABLE]

This additional ‘Monte Carlo wrapper’ around the RMS method enables us to approximate the relative error ${\rm RE}(\widehat{\gamma})$ with

[TABLE]

and we can approximate ${\rm RE}(\widehat{\alpha}_{A})$ and ${\rm RE}(\widehat{T}_{B})$ in a similar way. For each experiment we present a table with results corresponding to multiple values of the threshold $u$ . Each table displays the final estimator $\widehat{\gamma}$ as well as its estimate for ${\rm RE}(\widehat{\gamma})$ , as in (5.6), and ${\rm Eff}(\widehat{\gamma})$ , as in (5.1) based on the run of an MC estimator $\widehat{\gamma}^{\rm MC}$ .

Various checks can be done in order to assess the reliability of the estimator $\widehat{\gamma}$ . In each table we additionally give the estimate for ${\rm RE}(\widehat{T}_{B})$ ; if it matches the desired relative error, i.e. ${\rm RE}(\widehat{T}_{B})\approx 5\cdot 10^{-3}$ , then this is an indication that Assumptions ( ${\rm I}$ - ${\rm II}$ ) are satisfied. When ${\rm RE}(\widehat{T}_{B})$ is larger than desired, it might be a result of poorly chosen intermediate thresholds $\ell_{1},\ldots,\ell_{m}$ ; we propose to verify, after the algorithm has been executed, whether the estimates for all the intermediate probabilities $p_{1},\ldots,p_{m}$ roughly equal the optimal $p_{\text{opt}}\approx 0.20$ . If this is the case and we still get a particularly large ${\rm RE}(\widehat{T}_{B})$ , this is an indication that either the recurrency set or the importance function have not been properly chosen. In case of violation of the former, in Section 4.2 we proposed a test for the appropriateness of the choice of the set $A$ . Additional verification can be performed to assess whether resampling from the set $S_{\rm rec}$ obtained in Step 2 of the RMS algorithm is a good approximation of taking i.i.d. samples of $X^{A}_{1}$ . This implies that $\widehat{\alpha}_{A}$ and $\widehat{T}_{B}$ are independent, but if they are independent then necessarily

[TABLE]

Thus, if (5.7) is not approximately satisfied, it is an indication that $S_{\rm rec}$ does not represent the distribution of $X_{1}^{A}$ well. We emphasize that the relative error of $\widehat{\gamma}$ presented in the tables is calculated as in (5.6).

5.2 Ornstein-Uhlenbeck Process

Let $(X_{t})_{t\geq 0}$ be a $d$ -dimensional Ornstein-Uhlenbeck process ( $d$ -dim OU), i.e., a process taking values in $\mathbb{R}^{d}$ solving the SDE

[TABLE]

with $Q\in\mathbb{R}^{d\times d}$ and $(W_{t})_{t\geq 0}$ denoting a standard $d$ -dimensional Wiener process. Applying the explicit Euler numerical scheme to (5.8), with time step $h>0$ yields

[TABLE]

with $I$ the $d$ -dimensional identity matrix $I$ , and $Z_{1},Z_{2},\ldots$ i.i.d. $d$ -dimensional standard normal random variables. It is known Schurz (1999) that the stationary distribution $\mu$ of (5.9) exists if there exists a positive-definite matrix $M=(M_{ij})_{i,j\in\mathbb{N}}$ solving

[TABLE]

then the stationary distribution $\mu$ is $d$ -dimensional centered normal with covariance matrix $M$ . The rare event of our interest is the exceedance of a high threshold in the first dimension under the stationary distribution (of the discrete-time Markov chain in (5.9)), as in (5.2). Eq. (5.10) is a well-known Sylvester equation and its solution $M$ can be found numerically, so that $\gamma(u)$ can be evaluated as

[TABLE]

with $\Phi(\cdot)$ the standard normal cdf. Knowing the ground truth $\gamma(u)$ gives us means to determine how accurate the RMS estimator $\widehat{\gamma}$ is.

In the following three subsections we study the OU process with different sets of parameters but with the same choice of the recurrency set and importance function, as in (5.3) and (5.4). First, we study the simplest case of a one-dimensional OU process. This is an ‘ideal’ example in the sense that Assumptions ( ${\rm I}$ - ${\rm II}$ ) are (approximately) satisfied. Second, we study a multidimensional OU process; while the simplifying assumptions do not seem to be satisfied, they are ‘close enough’ for the RMS method to give satisfactory results. The third case describes a two-dimensional OU process with the matrix $Q$ chosen such that Assumptions ( ${\rm I}$ - ${\rm II}$ ) are not satisfied for our choice of the recurrency set and the importance function.

5.2.1 1-dim OU

In this experiment we put $d=1$ , $Q=1$ , $h=0.01$ . The recurrency set $A(\ell)$ and importance function $H(\cdot)$ are as in (5.3) with $\ell=0$ and (5.4) respectively.

If we would study the stationary distribution of the original SDE driven by (5.8) (rather than the time-discrete numerical solution in (5.9)), then the paths of the process would be continuous and thus $X^{A}_{1}=0$ a.s. Moreover, because of their continuity, these paths must cross all intermediate states $x\in(0,u)$ before reaching $B$ . Therefore $x\mapsto\mathbb{P}_{x}(\tau_{B}<\tau^{\text{in}}_{A})$ is an increasing function, implying that the distance-based importance function satisfies ( ${\rm I}$ ) in the continuous-time case. By similar arguments, $X_{\tau_{B}}=u$ a.s., and hence ( ${\rm II}$ ) is satisfied as well in that case.

The Markov chain driven by (5.9) is a discrete-time approximation of (5.8), so the assumptions will not be satisfied exactly. In particular, we note that for any time step $h>0$ , the support of $X_{\tau_{B}}$ is the entire halfline $[u,\infty)$ because in principle the process can exceed the threshold $u$ by any positive value upon the first entry. This shows that Assumption ( ${\rm II}$ ) is not satisfied. An analogous argument can be used to show that Assumption ( ${\rm I}$ ) is not satisfied either. Nonetheless, for a small time step $h>0$ , extreme overshooting upon the first entry (i.e., $X_{\tau_{B}}$ being significantly larger than $u$ , or $X_{\tau_{k}}$ significantly larger than $\ell_{k}u$ ) is very unlikely. We conclude that the assumptions are satisfied approximately.

Since the value of $\gamma(u)$ can be evaluated using (5.11), we chose the thresholds $u$ to match the desired value of $\gamma(u)$ , as in Table 1. The results show that ${\rm RE}(\widehat{T}_{B})\approx 5\cdot 10^{-3}$ , as desired in (5.5); this is a good indication that Assumptions ( ${\rm I}$ - ${\rm II}$ ) are satisfied. Also, the relative error calculated under the independence assumption via (5.7) matches the estimated ${\rm RE}(\widehat{\gamma})$ .

Conclusions. In this setting the RMS algorithm is very efficient, as compared with MC. The numerical results agree very well with the theoretical outcomes, confirming our observation that Assumptions ( ${\rm I}$ - ${\rm II}$ ) are approximately satisfied.

5.2.2 10-dim OU, $Q$ with real eigenvalues

In this experiment we put $d=10$ , $h=0.01$ . The matrix $Q=(Q_{ij})_{i,j\in\{1,\ldots,d\}}$ is randomly generated such that all its eigenvalues are real. The recurrency set $A(\ell)$ and importance function $H(\cdot)$ are as in (5.3) with $\ell=0$ and (5.4) respectively.

In Fig. 2 we plot four randomly chosen recurrency cycles, projected onto the first and second dimension, which have reached the rare event $B$ . These conditional paths seem to follow a linear pattern; similar behavior is seen in other projections (not shown). This indicates that attaining high values in the first dimension is coupled with attaining high values in the second dimension (and similar statements can be made about other dimensions). Therefore, the distance-based importance function is not expected to satisfy ( ${\rm I}$ ), as it does not take this behavior into account; an ideal importance function should give larger importance to states which attain simultaneously high values in the first and second dimension. While the distance-based importance function is not the most appropriate choice, it is still expected to give satisfactory results, as it drives the paths gradually towards the rare event.

The results of the RMS algorithm are presented in Table 2. It can be seen that the values of ${\rm RE}(\widehat{T}_{B})$ do not exactly match the desired value $5\cdot 10^{-3}$ in (5.5), which in view of the earlier discussion is not surprising, as we did not expect Assumptions ( ${\rm I}$ - ${\rm II}$ ) to hold. However, the estimates $\widehat{\gamma}$ are still very accurate, and the efficiency is still excellent (relative to the MC method).

Conclusions. This experiment shows that the RMS algorithm can be effectively implemented in a multidimensional setting, even when Assumptions ( ${\rm I}$ - ${\rm II}$ ) are violated. This underscores the robustness of the distance-based importance function.

5.2.3 2-dim OU, $Q$ with complex eigenvalues

In this experiment we put $d=2$ , $h=0.01$ . We choose $Q$ to have non-real eigenvalues: for a positive $\theta$ ,

[TABLE]

The drift generates a rotating (or spiraling) motion of the paths, with the speed of rotation increasing as $\theta$ increases. We compare the efficiency of the RMS method for increasing values of $\theta$ . The recurrency set $A(\ell)$ and importance function $H(\cdot)$ are as in (5.3) with $\ell=0$ and (5.4) respectively.

The results are presented in Table 3. We see that for most values of $\theta$ , RMS outperforms the Monte Carlo, but the larger $\theta$ is, the lower the efficiency ratio Eff( $\widehat{\gamma}$ ) becomes. At the same time, as $\theta$ grows, the value of ${\rm RE}(\widehat{T}_{B})$ deviates more and more from the desired target $5\cdot 10^{-3}$ , as in (5.5). This indicates a violation of Assumptions ( ${\rm I}$ - ${\rm II}$ ). We note that the estimates $\widehat{\gamma}$ are quite accurate nonetheless, with a minor relative error of a few percent visible for larger values of $\theta$ .

In Fig. 3 we plot five random recurrency cycles conditioned on reaching the rare set $B$ . We see that the paths do not gradually drift towards $B$ , but rather first move far away from $B$ , due to the drift-induced rotation. This hints that the distance-based importance function might be a poor choice. Fig. 4 shows that even property (4.3) seems to be violated. In this figure we compare the histograms of $S_{\text{\rm rec}}$ and $S^{q}_{\text{\rm rec}}$ in order to compare the distributions of $X^{A}_{1}$ and $X^{A}_{+}$ (see the discussion Section 4.2). The figure shows that $X^{A}_{+}$ has more probability mass in the sets $\{x_{2}\leq-1\}$ or $\{x_{2}\geq 1\}$ than $X^{A}_{1}$ .

Conclusions. When $Q$ has non-real eigenvalues, the naïve choice of the recurrency set and the distance-based importance function (i.e., (5.3) and (5.4)) seems inadequate and leads to a relative error higher than expected. This underscores the fact that one has to be careful with the choice of $A$ and $H(\cdot)$ and verify whether Assumptions ( ${\rm I}$ - ${\rm II}$ ) are satisfied; this can be done e.g. by the means described in Section 4.2. Despite violation of Assumptions ( ${\rm I}$ - ${\rm II}$ ), RMS still gives rather accurate estimates of $\gamma$ , and outperforms Monte Carlo for small $\theta$ .

5.3 Franzke (2012) Stochastic Climate Model

As our final example, we consider the low-order stochastic climate model presented by Franzke (2012). This is a 4-dimensional SDE with certain key features that are also present in more complex climate models, including nonlinear (quadratic) drift terms that are energy-conserving. We refer to Franzke (2012) for a more detailed discussion of the physical interpretation of this model.

The model is given by the following set of SDEs. It uses a standard, two-dimensional Wiener process $(W^{(1)}_{t},W^{(2)}_{t})$ . We write $x_{i}:=X^{(i)}_{t}$ , $y_{i}:=Y^{(i)}_{t}$ and $W_{i}:=W^{(i)}_{t}$ to simplify notation. We consider the system

[TABLE]

When the parameter $\varepsilon$ is set to a small value, a separation of timescales is created between the variables $x_{1},x_{2}$ (slow) and $y_{1},y_{2}$ (fast). The main interest is in the behavior of the slow variables $x_{1},x_{2}$ .

The parameters we use match those used in Franzke (2012). This means that we set $\mu=1$ , the $B$ -coefficients are given by $B_{123}^{1}=4$ , $B_{213}^{1}=4$ , $B_{312}^{1}=-8$ , $B_{131}^{2}=0.25$ , $B_{113}^{2}=0.25$ , $B_{311}^{2}=-0.5$ , $B_{242}^{3}=-0.3$ , $B_{224}^{3}=-0.4$ , $B_{422}^{3}=0.7$ , the $L$ -coefficients by $L_{13}=-L_{24}=-0.2$ , and the other parameters by $\omega=1$ , $a_{1}=1$ , $a_{2}=-1$ , $d_{1}=-0.2$ , $d_{2}=-0.1$ , $\gamma_{1}=\gamma_{2}=1$ , $\sigma_{1}=3$ , $\sigma_{2}=1$ . In addition we put $L_{12}=-L_{21}=1,\varepsilon=0.2$ . The forcing vector $(F_{1},F_{2},F_{3},F_{4})$ is given by $(-0.25,0,0,0)$ .

Since this process is non-standard, in order to build intuition, we first generated a contour plot of the estimated stationary density of $(x_{1},x_{2})$ ; see Fig. 5. The process turns out to randomly switch between two modes: one mode with $x_{1}\leq x_{2}$ and a second mode with $x_{1}\geq x_{2}$ . The estimated density function in Fig. 5 shows that the process is more likely to be in the second mode.

We use the explicit Euler scheme with $h_{0}=10^{-4}$ but we store the values of the process every $h=0.01$ . The small integration time step $h_{0}$ is needed for numerical stability. Similar to the previous examples, the rare event we study is the exceedance of a high threshold by $x_{1}$ under the stationary distribution, cf. (5.2). We choose the recurrency set $A(\ell)$ as in (5.3) with $\ell^{*}=7.9$ suggested by the algorithm (4.4). The importance function $H(\cdot)$ is as in (5.4).

The results of the RMS method are outstanding, see Tab. 4. For $u=18.5$ , when $\gamma(u)\approx 10^{-7}$ , we find Eff( $\widehat{\gamma}$ ) $\approx$ 1522. In other words, the RMS algorithm is more than 1500 times faster than MC. The values of ${\rm RE}(\widehat{T}_{B})$ match the desired $5\cdot 10^{-3}$ (see (5.5)) very closely even for very high thresholds, indicating that Assumptions ( ${\rm I}$ - ${\rm II}$ ) are satisfied. A random realization of a cycle reaching the rare event, shown in Fig. 6, is yet another indication that the distance-based importance function is a good choice, as the path seems to gradually drift towards the rare event.

Conclusions. This example shows a successful application of the RMS algorithm to a multidimensional nonlinear stochastic-dynamical model with characteristics of complex climate models. We find that RMS is up to three orders of magnitude faster than MC in this example, and the efficiency gain is expected to be even larger for higher thresholds $u$ .

6 Summary

In this manuscript we have proposed a new algorithm for the estimation of small steady-state probabilities $\gamma=\mu(B)$ , as in (1.1), of Markov processes with continuous state space. Our approach, which we have called the Recurrent Multilevel Splitting (RMS) algorithm, is based on the alternative representation (2.5) of $\gamma$ ( as given in Theorem 1). This representation is obtained by dissecting the path of the Markov process into recurrency cycles, each cycle beginning with an inwards crossing of a set $A$ . It allows to transform the problem of estimating $\gamma$ essentially into the problem of estimating $T_{B}$ , the expected time spent in the set $B$ in a recurrency cycle.

In order to efficiently estimate $T_{B}$ we use Multilevel Splitting (MLS), but we emphasize that other rare event simulation methods could have been used instead (such as Genealogical Particle Analysis or Importance Sampling). We have derived optimal parameters for the MLS in Appendix B, and we have shown (Theorem 3) that under simplifying assumptions, a suitable choice of the recurrency set $A$ in combination with the optimal choice of the parameters leads to logarithmic efficiency of the RMS algorithm.

In Section 5, four numerical studies were presented, where we used the RMS algorithm to estimate steady state probabilities of high threshold exceedances for various SDEs discretized in time. The experiments demonstrate that RMS gives accurate results. Furthermore, they unanimously show the efficiency gain of RMS compared to Monte Carlo; in the most notable case of the Franzke (2012) model (Section 5.3), RMS outperforms MC by up to three orders of magnitude.

One of the numerical experiments (Section 5.2.3) was designed to give suboptimal results, with an SDE displaying rotating motion so that the most straightforward choices of the recurrency set and importance function (as used in the experiments) were expected to be not very suitable. Although the estimates obtained with RMS were still quite accurate, the efficiency gain of RMS compared to MC was decreasing as the rotation speed was increasing. This example showed how the choice of the recurrency set and the importance function can impact the performance of the algorithm.

In light of this example, an interesting topic for future research is the choice of the recurrency set $A$ . As already mentioned in Section 4.2, a good choice of $A$ should be a suitable compromise between visiting $A$ relatively often and (4.3) being (approximately) met. We have proposed a method of optimizing $A(\ell)$ parametrized by $\ell$ in (4.4), and pointed out a method of testing whether $A$ satisfies (4.3) through a quantile validation (4.5). Further development of these ideas to construct an optimal $A$ is a challenging open research topic.

Acknowledgments

We thank the organizers of the Summer School in Monte Carlo for Rare Events (June 2016 at Brown University) for making lecture notes available. This work is part of the research programme ‘Mathematics of Planet Earth’ which was funded by the Netherlands Organisation for Scientific Research (NWO), grant number 657.014.003. Michel Mandjes’ research is partly funded by the NWO Gravitation Programme NETWORKS, grant number 024.002.003.

Appendix A Technical Results

Proof of Theorem 1.

Define a new Markov chain $Z_{n}:=(X_{n-1},X_{n})$ ; it is also positive Harris with a stationary measure $\widetilde{\mu}$ satisfying, for measurable sets $C_{0},C_{1}$ ,

[TABLE]

We see that the stopping times $S_{n}$ coincide with the times the process $Z_{n}$ visits a set $\mathcal{A}:=(A^{c},A)$ , with $A^{c}:=\mathbb{R}^{d}\setminus A$ . Since $\mu(A)\in(0,1)$ we have

[TABLE]

According to Meyn and Tweedie (2012, Thm. 10.4.9) we have, with $\tau_{\mathcal{A}}:=\inf\{n>0:Z_{n}\in\mathcal{A}\}$ ,

[TABLE]

Due to $\widetilde{\mu}(\mathbb{R}^{d},B)=\mu(B)$ , $\{Z_{n}\in(\mathbb{R}^{d},B)\}=\{X_{n}\in B\}$ , and $\widetilde{\mu}(\mathcal{A})=\alpha_{A}$ , it follows that

[TABLE]

Finally, we recognize that the conditioning above is equivalent to $X_{0}$ being distributed as an initial point of a recurrency cycle $X^{A}_{1}$ in stationarity, so that we conclude (2.5). Similarly, one can show that $\alpha_{A}=(\mathbb{E}_{\mu}L_{1})^{-1}$ by considering the expected time spent in $(\mathbb{R}^{d},\mathbb{R}^{d})$ within a recurrency cycle. ∎

Derivation of (4.1).

Notice that ( ${\rm I}$ ) implies that the number of times the $k$ -th threshold $r_{k}$ is hit, is distributed as a sum of $n_{k-1}\,r_{m-1}$ independent Bernoulli trials, each with probability of success $p_{k}$ :

[TABLE]

here $\text{Bin}(n,p)$ denotes a Binomial distribution with $n$ trials with success probability $p$ , with the convention that $\text{Bin}(0,p)\equiv 0$ . Similarly, ( ${\rm II}$ ) implies that the total time spent in the rare set is distributed as a sum of $n_{m}r_{m}$ independent copies from the distribution $R_{+}$ :

[TABLE]

where $R^{(1)}_{+},R^{(2)}_{+},\ldots$ are i.i.d. copies of $R_{+}$ (with the empty sum being defined as 0). Using (A.1) and the law of total variance we obtain, for $k\in\{1,\ldots m\}$ ,

[TABLE]

Similarly, using (A.2) we obtain

[TABLE]

Combining these results with (3.10) yields (4.1). ∎

Appendix B Derivation of Optimal Parameters

Following Amrein and Künsch (2011), we assume that the computational effort $w_{k}$ in the $k$ -th stage of Algorithm 1 (to sample a path starting from $X_{\tau_{k}}$ until $\min\{\tau_{k+1},\tau^{\text{in}}_{A}\}$ ) does not depend on the entry state $X_{\tau_{k}}$ . Simplifying this further, we assume that $w_{k}$ does not depend on $k$ , so without loss of generality,

[TABLE]

A more general cost $w_{k}$ can be considered for particular problems, see e.g. Lagnoux (2006).

Let $N_{k}:=n_{k}r_{k}$ , for $k\in\{0,\ldots,m\}$ , be the number of paths simulated in the $k$ -th stage of the algorithm, with $r_{0}:=1$ . Then the average total workload equals

[TABLE]

and since $\mathbb{E}r_{k}=p_{1}\cdots p_{k}$ , cf. (3.10), we conclude

[TABLE]

Finally, we formulate the minimization problem

[TABLE]

In our simplified setting, i.e., under Assumptions ( ${\rm I}$ - ${\rm II}$ ), we have derived a formula for the corresponding squared relative error in (4.1). We are able to solve the optimization problem above under the additional relaxation that the $n_{k}$ and $m$ are real and positive. To this end, it is helpful to denote

[TABLE]

Then we can write

[TABLE]

We want to minimize the workload $W$ under the constraint that

[TABLE]

We do this in steps. First, we fix $m$ and the conditional probabilities $p_{1},\ldots,p_{m}$ , so that $a_{1},\ldots,a_{m}$ are fixed (recall that $a_{m+1}$ is not a parameter of the algorithm). We relax the problem and let the splitting factors $n_{k}$ be allowed to attain any real, positive value. This means that we wish to solve (over $c_{1},\ldots,c_{m+1}>0$ )

[TABLE]

The corresponding Karush–Kuhn–Tucker conditions are

[TABLE]

with the gradient ‘ $\nabla$ ’ taken with respect to vector $(c_{1},\ldots,c_{m+1})$ . These are solved by

[TABLE]

with the optimal workload

[TABLE]

In the next step, we keep $m$ fixed and minimize over $a_{1},\ldots,a_{m}$ . Notice that $1+a_{k}=1/p_{k}$ , so that our minimization problem takes the form

[TABLE]

Not surprisingly, this system is solved by

[TABLE]

so that the optimal intermediate probabilities coincide:

[TABLE]

with the optimal workload being

[TABLE]

The final step is finding the optimal number of thresholds $m$ . We see that the minimizer of $W(m)$ is also a minimizer of

[TABLE]

Again, we relax this problem, allowing $m$ to be any real, positive number. Finally, the optimal parameters are:

[TABLE]

with $c\approx 0.6275$ solving $\exp(1/c)=2c/(2c-1)$ and the optimal workload reads as in (4.2). Since $m,n_{k}$ must be integers, we propose to simply round the optimal parameters to the closest integer. A similar result (but without the last splitting stage, in which we estimate the time spent in the set $B$ ) has been presented in (Lagnoux, 2006, Example 3.2.).

Appendix C Logarithmic Efficiency of the RMS Algorithm

In this section we study the efficiency of the RMS method, in the asymptotic regime that the rare event probability (1.1) tends to [math] (i.e. $\gamma\to 0$ ). First, we notice that if we fix the recurrency set $A$ , then $\alpha_{A}$ does not change as $\gamma\to 0$ ; hence we only have that $T_{B}\to 0$ . This indicates that asymptotic efficiency properties of RMS will be closely related to those of MLS. In order to study the performance of the estimator, we first introduce the concepts of strong and logarithmic efficiency.

Let $\widehat{\Psi}_{\ell}$ be a family of unbiased estimators for $\Psi_{\ell}>0$ , parametrized by $\ell$ such that $\Psi_{\ell}\to 0$ , as $\ell\to\infty$ . Let $W(\widehat{\Psi}_{\ell})$ denote the computation time corresponding to $\widehat{\Psi}_{\ell}$ . The estimator $\Psi_{\ell}$ is called strongly efficient if

[TABLE]

and logarithmically efficient if

[TABLE]

Strong efficiency implies that the workload needed to estimate the quantity of interest $\Phi_{\ell}$ with a desired accuracy ${\rm RE}^{2}(\Psi_{\ell})\leq\rho$ is uniformly bounded as $\ell\to\infty$ . Logarithmic efficiency implies that workload needed to achieve the accuracy ${\rm RE}^{2}(\Psi_{\ell})=\rho$ is increasing slower than $\Psi_{\ell}^{-\varepsilon}$ for any $\varepsilon>0$ , as $\ell\to\infty$ . Evidently, strong efficiency implies logarithmic efficiency.

Before we prove the logarithmic efficiency of RMS in Theorem 3 we show an inefficiency result for the Monte Carlo estimator for $T_{B}$ . Let $\widehat{T}^{\text{MC}}_{B}$ be a sample mean of $N$ independent copies of $R_{1}$ . We then have

[TABLE]

Now to achieve a desired level of accuracy ${\rm RE}^{2}(\widehat{T}^{\text{MC}}_{B})\leq\rho$ , assuming (B.1), the total required workload is

[TABLE]

As already noted in Section 4.1, $W(\widehat{T}^{\rm MC}_{B})$ is inversely proportional to $p_{B}$ and so it follows that the Monte Carlo estimator is not logarithmically efficient.

We have seen, cf. (4.2), that the workload of the MLS estimator with the optimal parameters $W(\widehat{T}_{B})$ is proportional to $(\log(p_{B}))^{2}$ . It turns out that under mild additional assumption, the MLS algorithm is logarithmically efficient and thus so is RMS. We make this rigorous in the following theorem.

Theorem 3 (Logarithmic Efficiency of RMS).

Fix the recurrency set $A$ and let the set $B_{\ell}$ be parametrized by $\ell$ , such that $\gamma_{\ell}:=\mu(B_{\ell})\to 0$ as $\ell\to\infty$ . Assume

$\circ$

that ** the* estimators $\widehat{\alpha}_{A}$ and $\widehat{T}_{B_{\ell}}$ are independent;*

$\circ$

that Assumptions ( ${\rm I}$ - ${\rm II}$ ) are valid for each $\ell$ ;

$\circ$

that the workload satisfies (B.1);

$\circ$

and that, for $\delta>0$ sufficiently small,

[TABLE]

Then the RMS estimator $\widehat{\gamma}_{\ell}$ for $\gamma_{\ell}$ , with the optimal choice of the parameters (B.2), is logarithmically efficient.

We point out that the first part of the assumption (C.5) is equivalent to strong efficiency of the crude Monte Carlo estimator for $R_{+}$ , under the workload assumption (B.1). This is not too restrictive, as often the main difficulty when estimating $T_{B}$ lies in the fact that $p_{B}$ is extremely small (and does not relate to the large variance of $R_{+}$ .) Since $\gamma_{\ell}\to 0$ and $A$ is fixed then necessarily $T_{B_{\ell}}\to 0$ . In the second part of (C.5) we require that there exists a $\delta>0$ such that $\mathbb{E}R_{+}p_{B_{\ell}}^{1-\delta}\to 0$ . Loosely speaking, it means that $p_{B_{\ell}}$ converges to [math] at least polynomially faster than $\mathbb{E}R_{+}$ grows to infinity; this is trivially satisfied when $\mathbb{E}R_{+}$ is bounded.

Proof of Theorem 3.

Since the recurrency set $A$ is fixed, the quantities $\widehat{\alpha}_{A}$ , ${\rm RE}(\widehat{\alpha}_{A})$ and $W(\widehat{\alpha}_{A})$ do not depend on $\ell$ . In addition, $\alpha_{A}\cdot T_{B_{\ell}}=\mu(B_{\ell})\to 0$ is equivalent to $T_{B_{\ell}}\to 0$ . Moreover, since $T_{B_{\ell}}=p_{B_{\ell}}\cdot\mathbb{E}R_{+}$ , cf. (3.6), and $\mathbb{E}R_{+}\geq 1$ , we necessarily have $p_{B_{\ell}}\to 0$ , as $\ell$ grows. Observe that

[TABLE]

We put ${\rm RE}(\widehat{T}_{B_{\ell}})=q$ . Then the workload $W(\widehat{T}_{B_{\ell}})$ is given as in (4.2), and we see that

[TABLE]

where $\delta>0$ is as in (C.5). Now since $p_{B_{\ell}}\to 0$ , we also have

[TABLE]

and $\gamma_{\ell}^{\varepsilon}W(\widehat{T}_{B_{\ell}})\to 0$ , which applied to (C.6) finishes the proof. ∎

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amrein and Künsch (2011) M. Amrein and H. R. Künsch. A variant of importance splitting for rare event estimation: Fixed number of successes. ACM Transactions on Modeling and Computer Simulation (TOMACS) , 21(2):13, 2011.
2Asmussen (2008) S. Asmussen. Applied Probability and Queues , volume 51. Springer Science & Business Media, 2008.
3Asmussen and Glynn (2007) S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis , volume 57. Springer Science & Business Media, 2007.
4Bisewski et al. (2018) K. Bisewski, D. Crommelin, and M. Mandjes. Simulation-based assessment of the stationary tail distribution of a stochastic differential equation. In Proceedings of the 2018 Winter Simulation Conference , pages 1742–1753, 2018.
5Calvin et al. (2006) J. M. Calvin, P. W. Glynn, and M. K. Nakayama. The semi-regenerative method of simulation output analysis. ACM Transactions on Modeling and Computer Simulation (TOMACS) , 16(3):280–315, 2006.
6Cérou and Guyader (2007) F. Cérou and A. Guyader. Adaptive multilevel splitting for rare event analysis. Stochastic Analysis and Applications , 25(2):417–443, 2007.
7Coles et al. (2001) S. Coles, J. Bawa, L. Trenner, and P. Dorazio. An Introduction to Statistical Modeling of Extreme Values , volume 208. Springer, 2001.
8Crane and Iglehart (1975) M. A. Crane and D. L. Iglehart. Simulating stable stochastic systems: III. Regenerative processes and discrete-event simulations. Operations Research , 23(1):33–45, 1975.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Rare Event Simulation for Steady-State Probabilities

Abstract

1 Introduction

2 Preliminaries

2.1 Continuous State-Space Markov Chains

2.2 Recurrent Structure of a Markov Chain

Theorem 1**.**

Proof.

3 Recurrent Splitting Algorithm

3.1 Estimation of αA\alpha_{A}αA​

3.2 Estimation of TBT_{B}TB​

Algorithm 1** (Multilevel Splitting).**

Theorem 2**.**

Proof of Theorem 2.

3.3 Estimation of γ\gammaγ

Algorithm 2** (Recurrent Multilevel Splitting).**

4 Choice of Parameters

4.1 Simplified Setting

4.2 Choice of Recurrency Set and Importance Function

5 Numerical Experiments

5.1 Implementation Details

5.2 Ornstein-Uhlenbeck Process

5.2.1 1-dim OU

5.2.2 10-dim OU, QQQ with real eigenvalues

5.2.3 2-dim OU, QQQ with complex eigenvalues

5.3 Franzke (2012) Stochastic Climate Model

6 Summary

Acknowledgments

Appendix A Technical Results

Proof of Theorem 1.

Derivation of (4.1).

Appendix B Derivation of Optimal Parameters

Appendix C Logarithmic Efficiency of the RMS Algorithm

Theorem 3** (Logarithmic Efficiency of RMS).**

Proof of Theorem 3.

Theorem 1.

3.1 Estimation of $\alpha_{A}$

3.2 Estimation of $T_{B}$

Algorithm 1 (Multilevel Splitting).

Theorem 2.

3.3 Estimation of $\gamma$

Algorithm 2 (Recurrent Multilevel Splitting).

5.2.2 10-dim OU, $Q$ with real eigenvalues

5.2.3 2-dim OU, $Q$ with complex eigenvalues

Theorem 3 (Logarithmic Efficiency of RMS).