Predicting Critical Transitions in Multiscale Dynamical Systems Using   Reservoir Computing

Soon Hoe Lim; Ludovico Theo Giorgini; Woosok Moon; J.S. Wettlaufer

arXiv:1908.03771·physics.comp-ph·December 14, 2020

Predicting Critical Transitions in Multiscale Dynamical Systems Using Reservoir Computing

Soon Hoe Lim, Ludovico Theo Giorgini, Woosok Moon, J.S. Wettlaufer

PDF

Open Access 2 Repos

TL;DR

This paper introduces a reservoir computing-based data-driven approach to predict rare critical transitions in multiscale nonlinear dynamical systems, demonstrating successful early predictions across various system dimensions.

Contribution

The paper presents a novel application of reservoir computing for early prediction of critical transitions in slow-fast dynamical systems, highlighting its effectiveness and limitations.

Findings

01

Predicts critical transitions several steps in advance

02

Effective across low and high dimensional systems

03

Highlights limitations in certain scenarios

Abstract

We study the problem of predicting rare critical transition events for a class of slow-fast nonlinear dynamical systems. The state of the system of interest is described by a slow process, whereas a faster process drives its evolution and induces critical transitions. By taking advantage of recent advances in reservoir computing, we present a data-driven method to predict the future evolution of the state. We show that our method is capable of predicting a critical transition event at least several numerical time steps in advance. We demonstrate the success as well as the limitations of our method using numerical experiments on three examples of systems, ranging from low dimensional to high dimensional. We discuss the mathematical and broader implications of our results.

Tables9

Table 1. Table 1 : Prediction results using the data set in Example 1 under various scenarios. N 𝑁 N is the number of data points in the training set, N v subscript 𝑁 𝑣 N_{v} is the number of data points in the validation set (see Appendix). In each figure: the target trajectory for x 𝑥 x (in red), N e f f subscript 𝑁 𝑒 𝑓 𝑓 N_{eff} predicted trajectories from N e f f subscript 𝑁 𝑒 𝑓 𝑓 N_{eff} independently trained and selected models, and the averaged predicted trajectory (in thick blue) up to several time steps into the future. The shaded light blue region represents the 90 % percent 90 90\% confidence interval of the predicted values.

Table 2. Table 2 : Error of the predicted results and its mean value (in thick blue) for Example 1.

Table 3. Table 3 : Standard deviation of the errors for Example 1. (a) N = 3700 𝑁 3700 N=3700 ; (b) N = 3690 𝑁 3690 N=3690 ; (c) N = 3680 𝑁 3680 N=3680 ; (d) N = 3670 𝑁 3670 N=3670 .

Table 4. Table 4 : Prediction results using the data set in Example 2 under various scenarios. N 𝑁 N is the number of data points in the training set, N v subscript 𝑁 𝑣 N_{v} is the number of data points in the validation set (see Appendix). In each figure: the target trajectory for x 𝑥 x (in red), N e f f subscript 𝑁 𝑒 𝑓 𝑓 N_{eff} predicted trajectories from N e f f subscript 𝑁 𝑒 𝑓 𝑓 N_{eff} independently trained and selected models, and the averaged predicted trajectory (in thick blue) up to several time steps into the future. The shaded light blue region represents the 90 % percent 90 90\% confidence interval of the predicted values.

Table 5. Table 5 : Error of the predicted results and its mean value (in thick blue) for Example 2.

Table 6. Table 6 : Standard deviation of the errors for Example 2. (a) N = 9410 𝑁 9410 N=9410 ; (b) N = 9400 𝑁 9400 N=9400 ; (c) N = 9390 𝑁 9390 N=9390 .

Table 7. Table 7 : Prediction results using the data set in Example 3 under various scenarios. N 𝑁 N is the number of data points in the training set, N v subscript 𝑁 𝑣 N_{v} is the number of data points in the validation set (see Appendix). In each figure: the target trajectory for x 𝑥 x (in red), N e f f subscript 𝑁 𝑒 𝑓 𝑓 N_{eff} predicted trajectories from N e f f subscript 𝑁 𝑒 𝑓 𝑓 N_{eff} independently trained and selected models, and the averaged predicted trajectory (in thick blue) up to several time steps into the future. The shaded light blue region represents the 90 % percent 90 90\% confidence interval of the predicted values.

Table 8. Table 8 : Error of the predicted results and its mean value (in thick blue) for Example 3.

Table 9. Table 9 : Standard deviation of the errors for Example 3. (a) N = 8233 𝑁 8233 N=8233 ; (b) N = 8231 𝑁 8231 N=8231 .

Equations54

\overset{x}{˙} (t)

\overset{x}{˙} (t)

\overset{x}{˙} (t)

\overset{x}{˙} (t)

\dot{ξ} (t)

d X (t) = F (X (t), t) d t + \tilde{σ} d W (t) .

d X (t) = F (X (t), t) d t + \tilde{σ} d W (t) .

f (t_{i}) = \frac{x ( t _{i + 1} ) - x ( t _{i} )}{Δ t} - F (x (t_{i}), t_{i}),

f (t_{i}) = \frac{x ( t _{i + 1} ) - x ( t _{i} )}{Δ t} - F (x (t_{i}), t_{i}),

x (t_{n + 1})

x (t_{n + 1})

y (t_{n + 1})

x^{(0)} (t_{n + 1})

x^{(0)} (t_{n + 1})

y (t_{n + 1})

x^{(0)} (t_{n + 1})

x^{(0)} (t_{n + 1})

x^{(l)} (t_{n + 1})

x^{(L)} (t_{n + 1})

y (t_{n + 1})

(W_{b})_{k l}

(W_{b})_{k l}

(V^{(j)})_{k l}

(W^{(i)})_{k l}

(W_{in})_{k l}

W_{b}, V^{(j)}, W^{(i)}, W_{in} \mapsto \frac{W _{b}}{∥ W _{b} ∥ + 0.001}, \frac{V ^{(j)}}{∥ V ^{(j)} ∥ + 0.001}, \frac{W ^{(i)}}{∥ W ^{(i)} ∥ + 0.001}, \frac{W _{in}}{∥ W _{in} ∥ + 0.001},

W_{b}, V^{(j)}, W^{(i)}, W_{in} \mapsto \frac{W _{b}}{∥ W _{b} ∥ + 0.001}, \frac{V ^{(j)}}{∥ V ^{(j)} ∥ + 0.001}, \frac{W ^{(i)}}{∥ W ^{(i)} ∥ + 0.001}, \frac{W _{in}}{∥ W _{in} ∥ + 0.001},

\tilde{W}^{(i)} \mapsto ρ_{d es}^{(i)} \frac{W ~ ^{(i)}}{ρ ^{(i)}},

\tilde{W}^{(i)} \mapsto ρ_{d es}^{(i)} \frac{W ~ ^{(i)}}{ρ ^{(i)}},

W_{o u t} min \frac{1}{N - i _{t r an s i e n t}} i = i_{t r an s i e n t} \sum N - 1 ∥ W_{o u t} \tilde{x} (t_{i}) - u (t_{i}) ∥^{2},

W_{o u t} min \frac{1}{N - i _{t r an s i e n t}} i = i_{t r an s i e n t} \sum N - 1 ∥ W_{o u t} \tilde{x} (t_{i}) - u (t_{i}) ∥^{2},

W_{o u t}^{o pt} = (\tilde{X}^{+} U)^{T},

W_{o u t}^{o pt} = (\tilde{X}^{+} U)^{T},

\overset{x}{˙} (t)

\overset{x}{˙} (t)

\overset{y}{˙}_{1} (t)

\overset{y}{˙}_{2} (t)

\overset{y}{˙}_{3} (t)

d X (t) = X (t) [1 - X^{2} (t))] d t + \tilde{σ} d W (t),

d X (t) = X (t) [1 - X^{2} (t))] d t + \tilde{σ} d W (t),

\overset{x}{˙} (t)

\overset{x}{˙} (t)

+ A cos (ω t) + σ_{0} z (t),

\overset{z}{˙} (t)

d X (t)

d X (t)

+ A cos (2 π t) d t + σ_{0} Z (t) d t,

d Z (t)

\overset{x}{˙} (t)

\overset{x}{˙} (t)

\dot{ξ}_{1}

\dot{ξ}_{2}

\dot{ξ}_{K}

\dot{ξ}_{n}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Advanced Thermodynamics and Statistical Mechanics · stochastic dynamics and bifurcation

Full text

Predicting Critical Transitions in Multiscale Dynamical Systems

Using Reservoir Computing

Soon Hoe Lim

Nordita, KTH Royal Institute of Technology and Stockholm University, 106 91 Stockholm, Sweden

Ludovico Theo Giorgini

Nordita, KTH Royal Institute of Technology and Stockholm University, 106 91 Stockholm, Sweden

Woosok Moon

Nordita, KTH Royal Institute of Technology and Stockholm University, 106 91 Stockholm, Sweden

Department of Mathematics, Stockholm University, 106 91 Stockholm, Sweden

J. S. Wettlaufer

Nordita, KTH Royal Institute of Technology and Stockholm University, 106 91 Stockholm, Sweden

Yale University, New Haven, Connecticut 06520, USA

Abstract

We study the problem of predicting rare critical transition events for a class of slow-fast nonlinear dynamical systems. The state of the system of interest is described by a slow process, whereas a faster process drives its evolution and induces critical transitions. By taking advantage of recent advances in reservoir computing, we present a data-driven method to predict the future evolution of the state. We show that our method is capable of predicting a critical transition event at least several numerical time steps in advance. We demonstrate the success as well as the limitations of our method using numerical experiments on three examples of systems, ranging from low dimensional to high dimensional. We discuss the mathematical and broader implications of our results.

Critical transitions are ubiquitous in nature. These transition events are often induced by a fast driving signal, and are rare and random. Since such events could lead to significant effects, it is important to develop effective methods to predict signal-induced critical transitions early. Recently many studies have been devoted to exploring early warning indicators to predict and characterize the onset of critical transitions. In this paper, we propose and test an alternative method to predict critical transitions within a class of multiscale dynamical systems. Our method is data-driven, inspired by recent advances in reservoir computing, and takes into account the multiscale nature of the systems. We demonstrate the success as well as the limitations of our method using numerical experiments on both low and high dimensional systems. We anticipate that our work will serve as catalyst for further progress in tackling the problem of predicting critical transitions using scientific machine learning.

I Introduction

The dynamics of many systems in nature are nonlinear, multiscale and noisy, making both the theoretical and numerical modeling and prediction of their states challenging. Of particular interest are those dynamics that often lead to rare transition events. Namely, the system under study spends very long periods of time in various metastable states and only very rarely, and at seemingly random times, does it hop between states. Such sudden changes in the dynamical behavior of complex systems are known as critical transitions scheffer2009early ; kuehn2011mathematical , occurring at so-called tipping points ashwin2012tipping .

One mechanism that explains the hopping behavior is that it is induced by a fast signal influencing the state of the otherwise closed system. The usually noisy driving signal comes from an external source. It is often the case that there is a large separation of time scales on which the system state and the signal evolve. Without the driving signal, the system will remain in one state forever. Understanding the dynamics of such system requires us to study the ensemble of transition paths between the different metastable states weinan2006towards ; forgoston2018primer ; hartmann2014characterization .

The above signal-induced phenomenon can be modeled by a class of non-autonomous open dynamical systems, whose state $x:[0,T]\to\mathbb{R}^{n}$ , $T>0$ , is described by:

[TABLE]

where $F(x(t),t)$ is a force field, providing a deterministic backbone, and $f(t)$ is a fast driving noisy signal, describing small perturbations imparted to the system. In principle, one does not have a priori knowledge of the mathematical model for $f(t)$ , but only has access to data for $x$ and, possibly, a mathematical model for $F(x(t),t)$ at their disposal. The unknown $f(t)$ is generally a random function, and it could be regular, chaotic or stochastic. It is challenging to infer an accurate model for $f(t)$ using only the data for $x$ . In fact, in many cases one could construct several models that are capable of describing the data equally well, and often there are ambiguities in the choice of model. An example that illustrates this issue is the problem of distinguishing deterministic chaos from stochasticity kaplan1993coarse ; agarwal2016maximal . Indeed, on one hand, when certain assumptions are satisfied, chaotic systems can be well approximated by a stochastic one beck1990brownian ; mackey2006deterministic ; chevyrev2016multiscale . On the other hand, many stochastic systems can be described by a chaotic model gaspard1998experimental . In the absence of a uniquely defined model for $f(t)$ , one has to resort to a model-free, data-driven approach, which is currently an extremely active area in applied dynamical systems. We refer the reader to the book by Brunton and Kutz brunton_kutz_2019 for an excellent review of data-driven methods.

An important and practical question concerning the system (1) is the following – can we predict the future evolution of the system state using only its data? Moreover, can we achieve this for a sufficiently long time, and to a desired level of accuracy and certainty? Note that this is different from asking one to infer a mathematical model (which may have a very low predictive power, as for example in the case of a random walk model) from the data. Typically, the data is complex and multiscale in nature, therereby complicating analysis and prediction. The main goal of this work is to propose and test a machine learning based solution to predict rare events in multiscale noisy nonlinear dynamical systems, making use of only the (slow) system state data and a partial knowledge of the physics of the generating system. Since the occurrence of rare events can have significant deleterious or positive implications, it is important to quantify and predict them in advance to inform decision making giorgini2020precursors – this is the main motivation of this paper. To the best of our knowledge this is the first attempt to solve this particular problem using a machine learning method. Here we will focus on the case where the system of interest is one-dimensional. While we focus on prediction of rare events, our method can be applied to predicting future states in general multiscale systems.

By exploiting recent advances in machine learning, we construct an algorithm to solve the prediction problem. The field of machine learning is experiencing a major recent resurgence of interest, with wide ranging applications and significant implications in many areas of science and engineering mehta2019high ; carleo2019machine . We note in particular neural networks and deep learning, whose industry wide applications have been made possible due to availability of large amount of data and advances in computer hardware development goodfellow2016deep . For instance, by training on a sufficiently large set of data, one can classify handwritten digits to unprecedented accuracy lecun1998gradient , predict and analyze time series vlachas2018data , infer the Hölder exponent of stochastic processes stone2020calibrating , characterize anomalous diffusion Bo19 , learn how to construct linear embeddings of nonlinear dynamics lusch2018deep , replicate chaotic attractors and calculate Lyapunov exponents pathak2017using , solve high-dimensional nonlinear PDEs raissi2017physics , and others that are too numerous to list here. As powerful as they appear, we emphasize that it is by no means an easy task to apply the tools of machine learning to real world data sets and one needs to proceed with caution and avoid potential pitfallsriley19 . Indeed, one of the main challenges in machine learning deals with the ability of the algorithm to be generalized to unseen data. Moreover, the empirical approach of machine learning is refined only with practices that have principally been discovered by trial and error. Our approach falls within the scope of scientific machine learning baker2019workshop , an emerging research area that is focused on the opportunities and challenges of machine learning in the context of applications across science and engineering.

This paper is organized as follows. In Section II, we motivate and describe the class of dynamical systems of interest from which the data is generated. They are special cases of (1) with a specified model for $f(t)$ . We then introduce the problem of data-based rare event prediction. In Section III, we present and explain in detail a reservoir computing based method to solve the prediction problem using a variant of the echo state network. In Section IV, we apply and test the method to predict rare events in three different systems. We make concluding remarks in Section V. The Appendix contains further details on training, validation (model selection) and testing.

II Multiscale Noisy Systems and the Prediction Problem

We consider scenarios where the available data is generated by the following family of continuous-time slow-fast systems (parametrized by $\epsilon>0$ ) Pavliotis ; rodenbeck2001dynamical ; berglund2006noise :

[TABLE]

with the initial conditions $x(0)=x_{0}$ and $\xi(0)=\xi_{0}$ , where $x_{0}$ and $\xi_{0}$ can be either fixed or random. In the case when both initial conditions are fixed, (2)-(3) describe a deterministic dynamical system; otherwise, it is a random dynamical system. In the above, $x:I\to\mathbb{R}$ is a slow process, $\xi:I\to\mathbb{R}^{m}$ is a fast process (noise), $F:\mathbb{R}\times I\to\mathbb{R}$ is a deterministic force field, $G:\mathbb{R}^{m}\to\mathbb{R}$ is a functional (observable) of the fast process, $H:\mathbb{R}^{m}\to\mathbb{R}^{m}$ is a deterministic vector field describing the fast process, $\sigma>0$ is a small constant controlling the strength of influence of the fast process on the slow one, and $I$ is a time interval on which the system evolves. Here $F,G$ and $H$ are $O(1)$ as $\epsilon\to 0$ . In many applications of interest, $F$ , $G$ and $H$ may be highly nonlinear. The assumption that the data is modeled by the above systems is not too restrictive, for as is very often the case in the analysis of experimental data, there is not a unique single model for the system generating the data.

Often one is interested in the dynamics of the slow process and the case where the driving signal, $f(t)=\frac{\sigma}{\epsilon}G(\xi(t))$ , is a stochastic process such as Gaussian white noise. A crucial property of dynamical systems of the form (2)-(3) is that stochastic behavior emerges as $\epsilon$ becomes smaller. Indeed, it can be rigorously shown that under appropriate assumptions on the initial conditions, $F$ , $G$ and $H$ , $x(t)$ converges in law (homogenizes) to $X(t)$ as $\epsilon\to 0$ for $t\in I$ , where $X(t)$ is a diffusion process solving the stochastic differential equation (SDE) melbourne2011note ; chevyrev2016multiscale :

[TABLE]

Here $\tilde{\sigma}>0$ is a constant (involving the time integral of the correlation of the fast process) and $W(t)$ is a Wiener process. Usually a mixing assumption is imposed on the fast flow. However, such an assumption is not necessary for homogenization. The system only needs to be sufficiently chaotic. See, for instance, Remark 2.1 in gottwald2013homogenization and also kelly2016smooth ; kelly2017deterministic ; chevyrev2016multiscale .

Therefore, the family of equations $\eqref{slow}$ - $\eqref{fast}$ can be viewed as approximations of a stochastic model. Such a perspective has been adopted to study a variety of noisy systems arnold2001hasselmann ; berner2017stochastic ; mitchell2012data ; hasselmann1976stochastic . We refer to Section IV for concrete examples. An important class of $\xi(t)$ are those that exhibit deterministic chaos, which has been studied and observed gaspard1998experimental in a range of physical systems. Note that $G$ can be generalized to be dependent on $x(t)$ , in which case the limiting SDE will have a multiplicative noise, but we will not pursue this case here.

We now formulate the prediction problem introduced in Section I systematically. First, we lay out our assumptions. The only data available to us is that for $x$ , which is assumed to be generated by the systems (2)-(3), and we do not have access to the data for $\xi$ . Let $I=[t_{init},t_{obs}]\cup[t_{obs},t_{f}]$ , where $t_{init}<t_{obs}<t_{f}$ . Here $t_{init}$ denotes an initial time, $t_{obs}$ denotes an observation time, beyond which we do not have access to system state, and $t_{f}$ denotes the final time at which the prediction will be made. Suppose that we are given a sufficiently long and high-frequency time series for $x$ on $[t_{init},t_{obs}]$ . Furthermore, we are blind to the actual mathematical model for the fast process. However, a partial knowledge of the physics of the system of interest is known. In particular, it is assumed that we know the exact expression describing the force field $F$ . This assumption is satisfied when one can reconstruct the force field from the data accurately, which may be possible in many practical situations garcia2018high .

We are given a time series data for $x$ , a sequence $(x_{k})_{k=0,\dots,N}=(x(t_{0}),x(t_{1}),x(t_{2}),\dots,x(t_{N}))$ , where $t_{0}=t_{init}$ , $t_{i}=t_{init}+i\Delta t$ ( $i=1,\dots,N$ ) are the sampling times, $\Delta t$ is the time step size, $t_{N}=t_{init}+N\Delta t=:t_{obs}$ , and $N+1$ is the number of available samples. Our time series does not record the occurrence of a rare event and we assume that a rare event will occur shortly after time $t_{obs}$ . Any precursors for this rare event must then be hidden in the time series. We then attempt to answer the following questions.

(1)

Can we predict if and when a rare event will occur in a given future time window? Can we infer the characteristics of the event?

(2)

How far in advance can we predict the rare event?

(3)

With what accuracy and certainty can we achieve these goals?

(4)

Is it possible to answer all of these questions with a computationally inexpensive method and/or using a relatively short time series data for $x$ ?

Clearly these are challenging questions. The degree of difficulty depends on the characteristics of the dynamical systems generating the data for $x$ . For a given amount of data the difficulty increases as $\epsilon$ becomes smaller, in which case the statistical behavior of the driving noise is closer to that of a white noise and so predictability is lost in the limit. Thus, analysis of the data should be performed on a case by case basis.

III A Machine Learning Based Prediction Method

III.1 The method

We first present a three-step procedure that will allow us to investigate the questions posed at the end of Section II. This procedure lies at the heart of our method. We then discuss a number of heuristic issues associated with our approach.

Algorithm III.1.

Predicting rare critical transition events.

Under the assumptions, setting and notation described in Section II:

(S1)

Feature extraction. Extract the fast driving signal using the data for $x$ and the known expression for $F$ :

[TABLE]

for $i=0,1,\dots,N-1$ , where $\Delta t$ is a uniform time step.

(S2)

**Machine learning. ** Using $(f(t_{i}))_{i=0,\dots,N-1}$ as the training data, predict the values of $f$ for $M+1$ time steps into the future (i.e., beyond time $t_{obs}-\Delta t$ ) using a supervised learning algorithm (for instance, the Algorithm III.2 in Section IV) that is best fit for the task to infer $(f(t_{N}),f(t_{N+1}),\dots,f(t_{N+M}))$ .

(S3)

Numerical integration. Numerically evolve the system (2) up to time $t_{N+M}$ , with $(f(t))_{t=t_{0},\dots,t_{N+M}}$ in place of $\frac{\sigma}{\epsilon}G(\xi(t))$ in (2), using the step size $\Delta t$ . The predicted values for $x$ in the time window $[t_{obs}+\Delta t,t_{f}=(N+M)\Delta t]$ are then obtained from the resulting numerical solutions.

Algorithm III.1 is the method that we propose and use for predicting rare events. It allows the prediction of the system state $M$ time steps beyond the observation time $t_{obs}$ , the result of which can be used to investigate the questions posed in Section II. To be able to answer these questions with a desired level of confidence, the predicted values should be as close as possible to the target (actual) values, i.e. the generalization (out-of-sample) error should be small.

Before we discuss the details of implementation for each of the three steps above, a few remarks are in order. A natural approach is to apply a suitable machine learning algorithm directly to the data for $x$ and attempt to predict the future states. While this seems like a sensible approach, it is unrealistic to expect an algorithm to learn the multiscale nature of the data accurately. In fact, it is not clear beforehand which algorithm is best and in many cases the rare event will not be detected successfully (see Section IV). Our method circumvents this challenge and provides an alternative route for handling multiscale data. Moreover, the simplicity of our method provides an additional advantage.

The quality and accuracy of the prediction results will rely strongly on how well each step in the algorithm is executed. Errors will accumulate as one progresses through these steps. Indeed, (S1) involves a numerical approximation of the driving signal. In (S2), errors will arise from both the use of the training data (where the numerical errors from (S1) are hidden) as well as the machine learning algorithm itself. In (S3), an additional error due to numerical integration is inevitable. Provided that the accumulation of these errors is negligibly small and well controlled, one can learn and predict the system states with reasonably good accuracy, as demonstrated with the examples in Section IV. Rigorous error analysis is not the focus of the present paper and so will not be presented here.

Steps (S1) and (S3) are straightforward to implement, so we must discuss (S2), whose implementation is the most challenging part of the method. We will formulate this step as the problem of learning the training data $(f(t_{i}))_{i=0,\dots,N-1}$ with parametrized high-dimensional nonlinear dynamical systems.

III.2 Echo state network (ESN) and its deep version

There are many machine learning algorithms that one can use to implement step (S2). Algorithms from deep learning include convolutional neural networks, recurrent neural networks (RNNs), and encoder-decoder networks, each of which can be implemented using various architectures and training schemes goodfellow2016deep . As nonlinear state space models, RNNs have dynamical memory, which means that they are capable of preserving in their internal state a nonlinear transformation of the input history. They are, therefore, particularly well suited to deal with sequential data. We will implement (S2) using a type of brain-inspired RNN known as the echo state network (ESN). For practical introductions and technical details on ESNs, we refer the reader to Refs. lukovsevivcius2012practical ; jaeger2002tutorial .

ESN belongs to the paradigm of reservoir computing tanaka2019recent and is computationally less costly to train than other variants of RNNs, which typically use a backpropagation through time algorithm for gradient descent based training jaeger2004harnessing ; lukovsevivcius2009reservoir . Similar to other RNNs, the ESN can, under fairly mild and general assumptions, be shown to be a universal approximator of arbitrary dynamical systems grigoryeva2018echo . In contrast to standard design and training schemes for RNNs, but conceptually similar to the kernel methods (c.f. lim2020understanding ; tino2020dynamical ), the neural network (called the reservoir) in ESNs is generated randomly and only the readout from the reservoir is trained. The outputs are linear combinations of the internal states and possibly the inputs and a constant (bias). This reduces the training to solving a linear regression problem, minimizing the mean squared error between the outputs and the target values.

It is crucial that typically the number of reservoir elements in an ESN is larger than that used in RNNs trained with a gradient descent based method, resulting in an over-parametrized neural network. The key observation is that with sufficient over-parametrization, gradient descent based methods will implicitly leave some weights describing the network relatively unchanged, so the optimization dynamics will behave as if those weights are essentially fixed at their initial values NIPS2019 . Fixing these weights explicitly leads to the approach of learning with random reservoir features. Therefore, we can successfully learn with gradient descent based trained neural networks whenever we can successfully learn with ESNs.

Even though not all the weights of the network are trained, it has been shown that ESNs work surprisingly well and achieve excellent performance in many benchmark tasks, including winning the NN3 financial time series prediction challenge ilies2007stepping . Recently ESNs have been shown to predict chaotic systems remarkably well pathak2017using ; Pathak18_spatiotemp ; pathak2018hybrid ; zimmermann2018observing ; lu2018attractor . They may outperform other machine learning algorithms in certain prediction tasks. For instance, it has been shown that the ESN substantially outperforms the deep feed-forward neural network and the RNN with long short-term memory (LSTM) for predicting short-term evolution of a multiscale spatio-temporal Lorenz-96 system npg-27-373-2020 (see also vlachas2020backpropagation ). Moreover, ESNs do not suffer from the vanishing and exploding gradient problem typically encountered when training other RNNs jaeger2002tutorial . These results motivate our choice of (a variant of the) ESN over other machine learning methods. We emphasize, however, that the ESN may not be the most optimal network for our prediction task and we remain mindful of its shortcomings, in particular its sensitive dependence on the hyperparameters. We leave careful comparison of different machine learning methods for our prediction problem to future work.

To achieve our goals, we also use a deep version of the ESN (see also the discussion in Section IV(b)) whenever needed. Our deep echo state network (DESN) consists of organized hierarchically stacked ESNs, whose architecture and training algorithm will be described in the following subsubsections. Such deep ESNs are more expressive than shallow ESNs, in the sense that they are able to develop in their internal states a multiple time-scale representation of the temporal information gallicchio2017deep ; gallicchio2020deep . We remark that in contrast to feed-forward neural networks, it is often not obvious how one should construct a deep RNN pascanu2013construct . In particular, different variants of deep echo state networks can be constructed, depending on the task at hand.

III.2.1 Architecture of the ESN and its deep version

Similar to other RNNs, the ESN is a parametrized, high-dimensional, discrete-time, non-autonomous, nonlinear state-space model, describing a dynamical input-output relation:

[TABLE]

for $n=0,1,\dots,N-1$ . Here $\bm{u}(t)\in\mathbb{R}^{n_{u}}$ is the input at time $t$ , $\bm{x}(t)\in\mathbb{R}^{n_{x}}$ is the internal/hidden state of the reservoir, $\bm{y}(t)\in\mathbb{R}^{n_{y}}$ is the output, $\bm{\nu}(t)\in\mathbb{R}^{n_{\nu}}$ is an external perturbation (noise/regularization), $\bm{f}$ and $\bm{g}$ are generally nonlinear functions, and $\bm{\theta}_{\bm{f}}$ and $\bm{\theta}_{\bm{g}}$ are model parameters. The network non-linearly embeds the input into a higher dimensional space where the original problem is more likely to be solved linearly.

Our ESN is a specific implementation of the above state-space model, putting constraints on a fully connected RNN and with the inputs $(\bm{u}(t_{0}),\dots,\bm{u}(t_{N-1}))$ set to be a training time series during the training phase. In the following, $t_{n}=n\Delta t$ , for $n=0,\dots,N-1$ , where $\Delta t$ is a fixed time step size. If the number of layers, $n_{L}:=L+1$ , is chosen to be one, i.e., a shallow ESN, then we have the following update equations for the states and outputs:

[TABLE]

for $n=0,1,\dots,N-1$ , with $\bm{x}(t_{0}):=\bm{x}^{(0)}(t_{0})=\bm{0}$ . In the above, we have used the following vector concatenation notation: $(\bm{a};\bm{b}):=(a_{1},a_{2},\dots,a_{n},b_{1},b_{2},\dots,b_{m})\in\mathbb{R}^{n+m}$ , for two vectors $\bm{a}=(a_{1},a_{2},\dots,a_{n})\in\mathbb{R}^{n}$ and $\bm{b}=(b_{1},b_{2},\dots,b_{m})\in\mathbb{R}^{m}$ . Also, we have used the following convention for component-wise application of the activation function: $\tanh(\bm{a}):=(\tanh(a_{1}),\dots,\tanh(a_{n}))\in\mathbb{R}^{n}$ for $\bm{a}=(a_{1},\dots,a_{n})\in\mathbb{R}^{n}$ .

Otherwise, we employ the following deep version of ESN, which we refer to as DESN:

[TABLE]

for $n=0,1,\dots,N-1$ , with the initial condition $\bm{x}(t_{0}):=(\bm{x}^{(0)}(t_{0});\dots;\bm{x}^{(L)}(t_{0}))=\bm{0}$ .

In (8)-(13), the training input $\bm{u}$ and output $\bm{y}$ come from a compact subset of $\mathbb{R}^{n_{u}}$ and of $\mathbb{R}^{n_{y}}$ respectively, the vector $\bm{x}^{(i)}\in\mathbb{R}^{n_{x_{i}}}$ (for $i=0,\dots,L$ ) is the $i$ th hidden state, the constant $b\in\mathbb{R}$ introduces bias, the matrices $\bm{W}^{(i)}\in\mathbb{R}^{n_{x_{i}}\times n_{x_{i}}}$ (for $i=0,\dots,L$ ) , $\bm{W}_{b}\in\mathbb{R}^{n_{x_{0}}}$ , $\bm{V}^{(i)}\in\mathbb{R}^{n_{x_{i}}\times n_{x_{i-1}}}$ (for $i=1,\dots,L$ ) and $\bm{W}_{in}\in\mathbb{R}^{n_{x_{L}}\times n_{u}}$ are fixed internal connection weights whose entry values are set to random values, the matrix $\bm{W}_{out}\in\mathbb{R}^{n_{y}\times(n_{x_{0}}+\dots+n_{x_{L}}+1)}$ is the readout weight matrix whose entries are to be learned. The vectors $\bm{\xi}^{(k)}_{n}$ , $k=0,\dots,L$ , $n=0,1,\dots,N-1$ , are i.i.d. random vectors describing added noise during the sampling at each layer and $\nu$ is a noise (regularization) intensity parameter (we have taken the same noise level for each layer). The activation function in each layer is taken to be a (vectorized) hyperbolic tangent function. At each training step, the input is fed into the first layer, which is then connected to the next layer via the connection weights. The ansatz for the output at each update time is taken to be a linear combination of the elements of the hidden states and a bias value.

III.2.2 Algorithm for training the ESN and its deep version

The above implementation gives a randomly constructed RNN prior to training. It may generally develop oscillatory or chaotic behavior even in the absence of external excitation by the input, and therefore the subsequent network states, starting from an arbitrary state $\bm{x}(t_{0})$ , may not converge to the zero state. To ensure that the ESN/DESN converges to the desired state, the internal connection weights are scaled such that the resulting (untrained) input-driven recurrent network (or the “dynamical reservoir”) is appropriately stabilized or “damped”, forcing it to have the so-called “echo state property". The echo state property ensures that the current network state is uniquely determined by the history of the input provided that the RNN has been run for a sufficiently long time (see jaeger2002tutorial ; manjunath2013echo and the references therein for details, subtleties, and other equivalent conditions for the echo state property). Once this initialization is made, we can proceed to the training phase and then the prediction phase, to be described in the following.

We now give a complete description of setting up, training and using the ESN and its deep version for the prediction task. It is based closely on the techniques developed in jaeger2002tutorial . For a schematic of a general ESN/DESN architecture and the training process, we refer to Figure 1.

Algorithm III.2.

Initializing, training, and using the ESN ( $L=0$ ) and DESN ( $L\geq 1$ ).

Given a training set consisting of the input sequence $(\bm{u}(t_{0}),\dots,\bm{u}(t_{N-1}))$ , find a trained ESN/DESN parametrized by $(\bm{W}^{(i)},\bm{W}_{b},\bm{V}^{(j)},\bm{W}_{in},\bm{W}_{out},\nu)_{i,j=0\dots,L}$ whose network output $(\bm{y}(t_{0}),\dots,\bm{y}(t_{N-1}))$ approximates the input sequence.

(1)

Initialize the ESN/DESN to “ensure”111Doing so will usually, but not always, ensure that the resulting network satisfies the echo state property jaeger2002tutorial . the echo state property is satisfied.

(1a)

Depending on the training data (length, difficulty of task, etc.), select appropriate dimensions/sizes (i.e., $n_{x_{0}},\dots,n_{x_{L}}$ ) for the connection weight matrices. These dimensions are hyperparameters that can be tuned. The matrix elements of these matrices are then selected randomly as follows:

[TABLE]

(1b)

These matrices are rescaled as follows:

[TABLE]

where $\|\cdot\|$ denotes the Frobenius norm.

(1c)

Set a fraction of elements (connection weights) in the matrices $\bm{W}^{(i)}$ to zero (the fraction, denoted $r_{i}$ , chosen is a hyperparameter for sparsity of the matrices used for each layer) and then rescale the resulting matrices $\tilde{\bm{W}}^{(i)}$ appropriately using their spectral radius, i.e.,

[TABLE]

where $\rho^{(i)}$ is the spectral radius of $\tilde{\bm{W}}^{(i)}$ , $\rho^{(i)}_{des}$ is the desired spectral radius, and $i=0,\dots,L$ . The desired spectral radius chosen is less than one to ensure contractivity of the dynamics and is another tunable hyperparameter. The sparsity hyperparameter is chosen in such a way that sufficiently rich dynamics of different internal units/hidden states can be obtained.

(1a)-(1c) then give an untrained network (dynamical reservoir) that satisfies (typically in practice) the required properties. We initialize the network state with $\bm{x}(t_{0})=\bm{0}$ .

(2)

**Train the readout by solving a least squares linear regression problem. **

(2a)

If needed, discard an initial transient by disregarding the first $i_{transient}=\min(\mathrm{int}(N/10),100)$ states, where $\mathrm{int}(x)$ is the integer part of $x$ and $\min(x,y)$ is the minimum of $x$ and $y$ .

(2b)

Run the network on the entire input (training) sequence and collect the output sequence, i.e. according to (8)-(9) for the shallow ESN and (10)-(13) for the DESN. During sampling we have added a small amount of noise or regularization, whose intensity is determined by the hyperparameter $\nu$ , to stabilize the network and prevent overfitting lukovsevivcius2012practical . We choose the elements of the i.i.d. noise vectors $\bm{\xi}_{n}^{(k)}$ to be distributed according to $\mathrm{Unif}(-0.5,0.5)$ .

(2c)

Solve the least squares regression problem:

[TABLE]

where $\tilde{\bm{x}}(t_{i})=(\bm{x}(t_{i});b)\in\mathbb{R}^{n_{x_{0}}+\dots+n_{x_{L}}+1}$ . We remind the reader that $\bm{W}_{out}\in\mathbb{R}^{n_{y}\times(n_{x_{0}}+\dots+n_{x_{L}}+1)}$ is the only trainable matrix in the ESN/DESN. As the above problem admits a closed form solution, the solution, denoted $\bm{W}^{opt}_{out}$ , can be obtained directly by applying the Moore-Penrose pseudo inversion as follows.

Let $\tilde{\bm{X}}=[\bm{X}\ \ \bm{b}]\in\mathbb{R}^{(N-i_{transient})\times(n_{x_{0}}+\dots+n_{x_{L}}+1)}$ , with $\bm{X}\in\mathbb{R}^{(N-i_{transient})\times(n_{x_{0}}+\dots+n_{x_{L}})}$ the design matrix, i.e. a block matrix stacked vertically with the matrices

$[\bm{x}^{(j)}(t_{i_{transient}})\ \ \ldots\ \ \bm{x}^{(j)}(t_{N-1})]^{T}\in\mathbb{R}^{(N-i_{transient})\times n_{x_{j}}}$ ( $j=0,\dots,L$ ), $\bm{b}=(b,b,\dots,b)\in\mathbb{R}^{N-i_{transient}}$ , and $\bm{U}=[\bm{u}(t_{i_{transient}})\ \ \ldots\ \ \bm{u}(t_{N-1})]^{T}\in\mathbb{R}^{(N-i_{transient})\times n_{y}}$ . Then the solution to (20) can be obtained as:

[TABLE]

where $+$ and $T$ denote the Moore-Penrose inverse and transposition respectively. This completes the training phase.

(3)

Run the trained ESN/DESN autonomously for prediction.

During the prediction phase, we use $\bm{y}(t_{m})=\bm{W}_{out}^{opt}(\bm{x}(t_{m});b)$ in place of $\bm{u}(t_{m})$ for $m\geq N-1$ in the update equation (12) for DESN (or (8) for ESN). We then propagate the trained network forward in time according to the resulting update equation. This allows us to obtain the predicted values $(\bm{y}(t_{N}),\dots,\bm{y}(t_{N+M}))$ , where $\bm{y}(t_{i})=\bm{W}_{out}^{opt}\tilde{\bm{x}}(t_{i})$ , for $i=N,\dots,N+M$ .

The smaller the generalization error on the prediction time horizon, the more accurate is the prediction. The quality of prediction achieved by our ESN/DESN depends on the selection of hyperparameters, which need to be chosen following an appropriate model selection procedure, which adapts to the data set on hand. The tunable hyperparameters are $n_{L}:=L+1$ (number of layers in the DESN), the $n_{x_{i}}$ (dimension of the connection weight matrices at layer $i+1$ ), the $r_{i}$ (sparsity parameter), the $\rho^{(i)}_{des}$ (the desired spectral radius of these matrices at layer $i+1$ ), and $\nu$ (noise intensity). We remark that the above algorithm gives a specific way to initialize and train the ESN/DESN. Other variants may also be considered lukovsevivcius2009reservoir . To implement (S2) in Algorithm III.1, we apply Algorithm III.2 to the training data $(f(t_{i}))_{i=0,\dots,N-1}$ . Together with (S1) and (S3), this completes the description of our rare event prediction method.

Lastly, we discuss a physically motivated intuition about ESNs (as can be said for DESNs). On one hand, ESNs can be constructed by sampling from a class of continuous-time dynamical systems sherstinsky2020fundamentals , which satisfy a universal property in the sense that they can approximate any continuous-time system on compact time intervals to arbitrary degree of precision. On the other hand, the data itself are generated from a nonlinear chaotic system. The ESN approach amounts to learning the data by feeding the data as input (signal) into an ESN, thereby inducing an interaction between the two dynamical systems. In our case, we would like the output of the network to reproduce the input time series, and good performance could be obtained under suitable conditions (when the hyperparameters are tuned optimally). From a physical point of view, these conditions cause the two dynamical systems to synchronize, and so learning here corresponds to finding optimal conditions to achieve generalized synchronization of the system generating the training data and a signal-fed ESN PhysRevE.99.042203 ; pecora2015synchronization . Indeed, the idea of synchronization between two dynamical systems has been exploited for time series prediction PhysRevLett.101.154102 . It remains interesting to understand how out-of-sample performance of ESNs depend on the characteristics of input data and the noise (either those present in the data or those added to regularize the training) using the notion of generalized synchronization.

IV Numerical Experiments

We apply the method presented in Section III to study the questions posed in Section II for three data sets generated by dynamical systems of different complexity.

From the data set in each example, a training set (with $N+1-N_{v}$ data points), a validation set (with $N_{v}$ data points) and a test set (with $M$ data points) are assigned before we apply the method. We are going to perform an ensemble based prediction by utilizing multiple independently trained networks. We take weighted averaged values of the predicted values produced by these networks as the final predicted values. We refer to Appendix, in particular Figure 5 there, for details of training and prediction procedure. We emphasize that while a comprehensive investigation of dependence of the results obtained here on the choice of hyperparameters is interesting on its own, this is not our focus here.

Throughout this section, all variables considered are real and one-dimensional.

IV.1 Example 1: A bi-stable system, driven by a fast Lorenz-63 system

Data generation. The data for $x$ is generated by the following slow-fast system givon2004extracting :

[TABLE]

In (22)-(25), $x$ is the state of the system of interest and its evolution is driven by a fast chaotic signal $y_{2}/\epsilon$ , which is modeled as follows. The vector state $(y_{1},y_{2},y_{3})$ is described by the Lorenz-63 model with the classical parameter values that lead to chaotic behavior lorenz1963deterministic . At these parameter values, $y_{2}$ is ergodic with invariant measure supported on a set of zero volume. The equation for $x$ is therefore an ODE driven by a fast chaotic signal with characteristic time $\epsilon^{2}$ .

To generate the data, we use a uniform time step of $\Delta t=0.01$ to integrate (22)-(25) with $\sigma=0.08$ , $\epsilon=0.5$ , $x(0)=-1.5$ , and $y_{i}(0)\sim\mathrm{Unif}(-10,10)$ for $i=1,2,3$ , up to time $t=100$ . The autocorrelation time (defined as the time $\tau$ at which the autocorrelation function $R(\tau)=1/e$ ) of samples of the driving signal is estimated to be $0.05$ . Note that the time step is chosen to be small enough so that we can sample the scale on which the fast driving signal takes place. The time series generated for $x$ is plotted in Figure 2.

We remark that repeating the above data generation using a different numerical time step, initial conditions (or random seeds) and parameters will produce time series with different characteristics. For instance, the resulting time series will display a different number of critical transitions occurring at different times. We assume that we only have access to a segment of the single time series plotted in Figure 2 up to time $t<\tau_{0}$ .

Generating system: dynamics and applications. Applying the discussion in Section II, the family of systems (22)-(25) (parametrized by $\epsilon$ ) can be viewed as an approximation to the Markovian system:

[TABLE]

where $\tilde{\sigma}$ is an effective diffusion constant and $W(t)$ is a Wiener process, in the sense that as $\epsilon$ becomes smaller $x(t)$ converges in law to the process $X(t)$ solving the SDE above. Recall that we have chosen $\epsilon=0.5$ for data generation and therefore the data can be thought of coming from an approximately stochastic system.

In the absence of the driving signal, the equation for $x$ has two stable fixed points, at $x=-1$ and $x=1$ , and an unstable one at $x=0$ . Starting from an initial state, the system will eventually evolve towards a nearby stable state. The presence of noise alters this dynamics, causing an occasional transition of the system between stable states. In the case where the noise amplitude is small, such a transition is a rare event, occurring at a seemingly unpredictable time (see Figure 2). In our case, $x$ starts near the fixed point $x=-1$ and will eventually jump to that at $x=1$ , the prediction of which is of great interest. From the data, we find that the first crossing time of the sample path to zero is $\tau_{0}:=\inf\{t\in[0,100]:x(t)>0\}=38.52$ . Our goal is then to predict an approaching rare event using only (a segment of) the data consisting of time series up to time $t<\tau_{0}$ .

From a statistical mechanical point of view, the system (22)-(25) describes an overdamped Brownian particle moving in a symmetric double-well potential. In this case, $x$ is the position of the particle and time-integrated $y_{2}$ models the fluctuations due to its interaction with the environment. The above model for $x$ is also often used in climate physics, an example of which is to view $x$ as the sea-surface temperature anomaly and $\xi=\frac{\sigma}{\epsilon}y_{2}$ as the impact of small-scale atmospheric variability hasselmann1976stochastic ; mitchell2012data ; berner2017stochastic .

Results and discussion. The prediction results for different training scenarios using Algorithm III.1 are displayed in the figures in Table 1-3. For each case where a fixed number of training data points is used, we present two prediction models, each of which depends on the size $(N_{v})$ of the validation set used (see Appendix for details).

Figure (1a)-(1b) and Figure (2a)-(2b) in Table 1 show that our method is capable of quite confidently predicting the rare event and its transition path at least 162 numerical time steps in advance, i.e., prior to first crossing of the slow process to zero at $\tau_{0}=38.52$ . Or equivalently, at least about 32 autocorrelation time of the fast driving signal. In particular, the target future trajectory lies principally within the $90$ percent confidence interval of the predicted trajectory. Given that a relatively short time series was used for training, it would appear that the accuracy of these predictions is remarkable.

Figure (3a)-(4b) show how the inability of the ESN to make accurate long-term predictions makes accurate prediction of the entire transition path (i.e., up to time $t=40$ ) from an earlier time more difficult. Indeed, the longer into the future the prediction is, the less confident the results are. This is shown by the growth of the standard deviation of the prediction errors, which seems to saturate about a maximum value that increases with fewer training data points. However, the results in Figure (3a) and Figure (4b) are still respectable, as they indicate that the trajectory will cross the origin during the prediction time window, albeit not able to trace the target transition path precisely.

These figures also depict the difference in the prediction results obtained with different values of $N_{v}$ used (notably in the figures (2a)-(4b)), showing intricacies of the machine learning based method. However, the qualitative behavior of mean value of the error of the predicted results are essentially the same for different $N_{v}$ (see the figures in Table 2). The same can be said for the standard deviation of these errors, which grows rapidly on the prediction horizon where the rare transition occurs and then approaches a plateau after that (see the figures in Table 3). This plateau is slightly higher in all the cases where $N_{v}=20$ than those where $N_{v}=10$ , indicating the higher variance of the prediction models which set aside more data points for validation. However, for the case of $N=3670$ , the model with the higher variance give better predictive performance than that with the lower variance, and is able to predict a cross-over to the origin during the prediction interval.

Figure (0a)-(0b) confirm that our method is far better than the direct method of applying an ESN to the data for $x$ , which we take as a baseline result for comparison. We emphasize that one can also apply other machine learning algorithms such as the gradient descent based RNN and convolutional neural network to implement the direct method bengio2009learning . However, after some experiments we found that the predictive performance is similar to the baseline result, i.e., they fail to predict the approaching rare event, rendering the prediction task almost impossible using the direct method. This comparison study enhances the veracity of our method, which exploits the crucial idea of appropriately taking into account the multiscale nature of the system generating the data. The success of our method in predicting the rare transition event lies in its ability to separate the slow and fast components of the data with the help of some physical knowledge about the generating system.

We remark that if we run the trained networks much further into the future, the predicted trajectory, not suprisingly, fails to capture the second transition that occurs around $t=71$ . However, we will be able to predict the second transition confidently if more data points are used for training. All the results discussed so far are obtained by training on a trajectory that starts from the initial time ( $t_{init}=0$ ) and ends at a time before a rare event occurs. A natural question is whether one can achieve prediction results of comparable quality using a shorter trajectory that comes sufficiently close to the rare event but starts at a later time $t_{init}>0$ . Given the chaoticity of the generating system with a known predictability horizon determined by the Lyapunov exponent, one expects the answer to this question is affirmative and can indeed be supported by numerical experiments.

Lastly, we emphasize that our problem is more challenging than that of predicting the second component of a Lorenz-63 system from numerical data generated by the system itself. Even though we expect the extracted driving signal here is representative of second component of the Lorenz-63 system (up to a scaling factor) and so one would expect that good prediction results are achievable (in lights of recent results pathak2017using ), there are inherent errors in the extraction process (S1) and so here we are really dealing with a noise corrupted version of the data. In this case, the prediction could be highly non-trivial (and perhaps impossible if the errors are large enough) due to possible dominance of noise in certain segments of the reconstructed time series.

IV.2 Example 2: A tri-stable system, with periodic forcing and a fast Ornstein-Uhlenbeck-like process

Data generation. The data for $x$ is generated by the following slow-fast system:

[TABLE]

where $x$ describes the state of the system of interest whose evolution is driven by a fast Ornstein-Uhlenbeck like signal $z$ , and $y_{2}$ is the second component of the Lorenz-63 system (23)-(25). For data generation, we use a uniform time step of $\Delta t=0.01$ to integrate (27)-(28) with $x(0)=0.1$ , $z(0),y_{i}(0)\sim\mathrm{Unif}(-10,10)$ (for $i=1,2,3$ ), $A=0.5$ , $\omega=2\pi$ , $\epsilon=0.5$ , $\sigma_{0}=0.2$ , $\alpha_{1}=1000$ , and $\sigma_{1}=1000\epsilon$ , up to time $t=100$ . The autocorrelation time of samples of the driving signal is estimated to be $0.07$ . The time step is small enough to sample the scale on which the fast signal takes place. The time series generated for $x$ is plotted in Figure 3. We assume that we only have access to a segment of the single time series plotted in Figure 3 up to time $t<\tau_{1}$ .

Generating system: dynamics and applications. The system (27)-(28), is more complex than that in Example 1, of which it is an extended version, in the sense that the force field is generalized to include a time-dependent external force and the driving signal is described by a higher dimensional system. One can view (27)-(28) as a family of systems (parametrized by $\epsilon$ ) approximating the following SDE system:

[TABLE]

where $\tilde{\sigma}_{1}$ is an effective diffusion constant and $W(t)$ is a Wiener process. As $\epsilon$ (chosen to be $0.5$ for data generation here) becomes smaller, $(x(t),z(t))$ converges in law to $(X(t),Z(t))$ solving the non-autonomous SDE system (29)-(30).

In contrast to Example 1, in the absence of the driving signal $z(t)$ , equation (27) for $x$ has three stable periodic orbits centered at $x=-2,0,2$ and two unstable ones centered at $x=-1,1$ . Here our system state starts in the middle potential well and will, due to influence of the driving signal as well as the periodic forcing, transits to one of the left or right nearby stable orbits at a random time (see Figure 3). It is thus natural to ask which potential well will the system transit to and at what time will the transition occur. From the data, we find that the first crossing time of the sample path to one is $\tau_{1}:=\inf\{t\in[0,100]:x(t)>1\}=96.16$ .

The addition of the periodic forcing $A\cos(2\pi t)$ introduces another time scale into the system. When this time scale is of the same order of the mean exit time from the potential (the Kramers time), a resonance-like mechanism where the noise can lead to the amplification of the periodic signal takes place. This resonance is induced by a chaotic signal, and is closely related to stochastic resonance, a noise-induced phenomenon first introduced in the context of climate modeling benzi1981mechanism ; benzi1982stochastic , which has been found to occur in many physical and biological systems gammaitoni1998stochastic .

One example of such systems describes the dynamics of an overdamped self-propelled active particle, which converts energy absorbed from the environment into a directed motion, rendering the system out of equilibrium. The position of the particle can be described by $X$ in (29) and the active force or self-propulsion is modeled by $Z$ in (30), in which case the particle is trapped in a triple-well potential and subject to periodic forcing. This is a variant of the model of an active Ornstein-Uhlenbeck process widely used to study active matter romanczuk2012active .

Results and discussion. The prediction results for different training scenarios using Algorithm III.1 are displayed in Figures 4-6. For each case where a fixed number of training data points is used, we present two prediction models, each of which depends on the size $(N_{v})$ of the validation set used (see Appendix for details).

Figure (1b) and Figure (2b) in Table 4 show that our method is capable of quite confidently predicting the rare event and its transition path at least 216 numerical time steps in advance, or at least about 30 autocorrelation time of the fast driving signal prior to first cross-over of $x$ is at $t=\tau_{1}$ . Even though the prediction results in Figure (1a) and Figure (2a) are inferior to those with the larger $N_{v}$ , they are not too far behind since the cross-over to the origin is successfully predicted there.

However, the figures (3a)-(3b) in Table 4 show how the inability of the DESN to make accurate long-term predictions again cripples the performance of the prediction method, making prediction of the rare event from an earlier time a very challenging task. In particular, neither models can confidently predict the cross-over of $x$ to the origin during the prediction interval, although the result obtained by the model with $N_{v}=4$ comes closer to achieving that. We emphasize that these results are used primarily to demonstrate our method. It may be possible to improve these results by using a different training setting (with a higher number of layers and more carefully optimizing values of the hyperparameters), enabling successful longer-term prediction and therefore prediction of a rare event from an earlier time.

These figures again depict the difference in the prediction results obtained with different values of $N_{v}$ used (see the figures in Table 5). Just like the case in Example 1, the qualitative behaviors of both the mean value and standard deviation of the error of the predicted values are essentially the same for different $N_{v}$ . However, the standard deviation plateau is higher in all the cases where $N_{v}$ is lower (see Table 6), in contrast to the finding in the case of Example 1. This indicates that larger values of $N_{v}$ should be considered here to obtain a prediction model with a lower variance.

We find that it is very difficult to obtain a comparable quality of predictions if we work with a shallow ESN, instead of the three-layered variant that we have used here. This finding may be explained by the following. A closer look at the model for the driving signal $z(t)$ reveals that there are actually two widely separated time scales in the system generating it, instead of only one time scale as in the case of Example 1. This difference in the multiscale behavior of the system generating the driving signal explains why a shallow ESN works well for Example 1 but not for Example 2. A deep version of the ESN is needed to handle the multiscale nature of the data for $z$ in Example 2. This supports the intuition that increasing the depth of the ESN can lead to a better multiple time scale representation of the temporal information. This motivates our consideration of a deep version of the ESN for our method in Section III.

The figures (0a)-(0b) in Table 4 confirm that our method is far superior to the baseline method of applying a DESN to the data for $x$ . Indeed, the ability of our method to predict, at least 216 time steps (at least 2 periods) in advance, has significant importance for many systems exhibiting stochastic resonance. It is possible to achieve prediction results of comparable quality using a shorter but sufficiently long trajectory starting at a later time $t_{init}>0$ and coming sufficiently close to the rare event (see the relevant discussion on this in Subsection IV.1).

IV.3 Example 3: A tri-stable system, subject to periodic forcing and driven by a multiscale Lorenz-96 system

Data generation. The data for $x$ is generated by the following slow-fast system:

[TABLE]

In (31)-(35), $x$ is the state of the system of interest whose evolution is driven by a fast $K$ -dimensional Lorenz-96 system whose state is denoted by the vector $(\xi_{1},\xi_{2},\dots,\xi_{K})$ . We work with $K=36$ and $F=8$ , which are parameter values leading to chaotic dynamics lorenz1996predictability . For data generation, we use a uniform time step of $\Delta t=0.01$ to integrate the above system with $x(0)=0$ , $\xi_{k}(0)=F=8$ for all $k$ except for $k=20$ , with $\xi_{20}(0)=F+0.01$ , $B=0.1$ , $\Omega=12\pi$ , $\sigma=0.4$ and $\epsilon=0.5$ , up to time $t=100$ . The autocorrelation time of samples of the driving signal is estimated to be $0.11$ . The time step has again been chosen to be small enough to sample the scale on which the fast signal takes place. The time series generated for $x$ is plotted in Figure 4. We assume that we only have access to a segment of the single time series plotted in Figure 4 up to time $t<\tau_{1}$ .

Generating system: dynamics and applications. The system described by (31)-(35) is similar to the one in Example 2 except that it is driven by a different periodic forcing and the fast driving signal comes from a single component of a variant (due to the scaling introduced by $\epsilon$ ) of the Lorenz-96 system, a much higher dimensional chaotic system compared to the ones considered in the previous examples. We remark that, due to the scaling introduced, the limiting slow dynamics are deterministic (averaging) rather than stochastic (homogenization). The critical transitions in this multiscale system (31)-(35) are induced by finite $\epsilon$ -effects which are fluctuations around some constant mean driver.

The Lorenz-96 model (with $\epsilon=1$ ) above is the first model introduced by Lorenz in lorenz1996predictability . It is a model extensively used in data assimilation and parameter estimation (parametrization) research, as well as in testing machine learning algorithms for parameter learning (sub-grid parametrization) vlachas2020backpropagation ; gmd-2019-136 . It was originally used by Lorenz as a one-dimensional atmospheric model. The variables in the model represents values of some atmospheric quantity in $K$ sectors of a latitude circle, giving a periodic system of $K$ ODEs. The basic physics of the atmosphere is captured in the right hand side of the ODEs, which contains advection terms, damping terms, and external forcing. One can therefore view $x$ as a sea-surface temperature anomaly, influenced by the atmospheric quantity in one of the sectors of the latitude circle in the presence of a seasonal forcing. From the data, we find that the first crossing time of the sample path to one is $\tau_{1}:=\inf\{t\in[0,100]:x(t)>1\}=82.39$ .

Results and discussion. Figure (1a) in Table 7 shows that our method is capable of confidently predicting the rare event and its transition path at least 6 time steps in advance, prior to first crossing of the slow process to one at $\tau_{1}=82.39$ . This is equivalent to about 0.55 autocorrelation times of the fast driving signal. The target future trajectory lies mostly within the $90$ percent confidence interval of the predicted trajectory. This result is far better than the direct method of applying an ESN to the data for $x$ (see the figures in (0a)-(0b)), which we take as the baseline results.

Figure (1b) and Figure (2a)-(2b) in Table 7 show that the prediction models are unable to anticipate the rare transition event in the prediction interval. Therefore, in comparison to the results in earlier examples, here the predictive performance of our models are far inferior. However, we still manage to produce an accurate prediction model as shown in Figure (1a). This is despite the fact that we are working with a set of data extracted only from one of the components of the Lorenz-96 system and moreover these data are noise corrupted – a highly challenging task even in the case when the data is noise-free (see vlachas2020backpropagation ). We have also attempted to use deeper versions of network for training but, unlike the case of Example 2, we are unable to obtain better predictions than those obtained by shallow networks.

Figure (1a)-(2b) in Table 7 depict the difference in the prediction results obtained with different values of $N_{v}$ used, as discussed before. Unlike the earlier examples, the qualitative behavior of the mean value and standard deviation of the error of the predicted results are not the same for different $N_{v}$ in the case when $N=8233$ (see the figures in Table 8 and Table 9). This indicates high sensitivity of the prediction models to the choice of $N_{v}$ when dealing with highly complex multiscale data. We suspect that such sensitivity is due not only to the sensitivity of the echo state network to the choice of parameters, but also due to the noise corrupted nature of the extracted data. It is interesting to observe that the successful prediction model described here produces predictions whose standard deviation of the error shows a peak on the prediction interval, whereas those for the unsuccessful ones have strictly increasing standard deviation on the interval.

V Conclusions

We have presented a data-driven method for tackling the task of predicting noise-induced critical transition events in a large class of multiscale nonlinear dynamical systems. The method was demonstrated on three examples of different complexity, each providing a model for many important physical and biological systems. For Example 1 and Example 2, we obtain excellent predictions that would not be possible by using a direct method that does not take into account the multiscale nature of the problem. In particular, our method successfully predicts a rare transition event up to several numerical time steps in advance. Each run of the ESN/DESN incurs relatively low computational cost, thanks to the use of a reservoir computing based training technique, rather than a gradient descent based method. This low computational cost allows us to leverage the power of ensemble learning for the predictions.

These results demonstrate the promise of our reservoir computing based method in predicting rare events occurring in a wide range of dynamical systems, a problem that is of substantial interest in science and engineering. We expect the accuracy of these results to improve by carefully optimizing the hyperparameters, using a more refined training method and other more sophisticated machine learning algorithms, at the expense of a higher computational cost. Importantly, rather than demonstrating only the successful results, we also highlight the limitation of the approach when dealing with highly complex multiscale data sets, such as that shown in Example 3. We also empirically show and discuss the sensitivity of the method to parameters and training data. Apart from achieving successful predictions, understanding and mitigating these issues are equally important. Therefore, our findings open up a range of interesting problems for future work.

We now discuss a few potential future directions. Thus far we have applied the method to “toy examples”, where there are few widely separated time scales. In many systems of interest there may be many widely separated time scales and the competition between them may be crucial in triggering a critical transition event. The driving noise may also be multiplicative in nature. Therefore, it is important to extend the present work to these systems. In many realistic situations, the available time series data may be multivariate, rather than univariate, and additionally there may be missing and/or uneven data. It would then be important to extend our method to treat these situations.

It is also of practical interest to apply the method presented here to study more non-trivial yet physically relevant data sets, such as those generated by a chaotic version of the model in eisenman2009nonlinear and/or real world data from climate science 2019arXiv190605433R ; Scher18 . On the other hand, because the method is based on the use of (a deep version of) echo state network, a firm theoretical understanding of the underpinnings behind such a network, in particular its initialization (e.g., the role of randomness scardapane2017randomness ) and the generalization error, will shed light on the nature of the prediction results. Therefore, it is important to carry out a systematic theoretical study to understand how the network works. This is in fact an important problem in the field of reservoir computing and vigorous efforts have been made in recent years to tackle this task (see, for instance, gonon2019risk ; hart2020embedding ; gonon2020approximation ; cuchiero2020discrete ; verzelli2020input ; verzelli2020learn ).

Data Availability

The complete codes (in Python) that reproduce all the results obtained in this paper are openly available in GitHub at:

https://github.com/shoelim/predicting-rare-critical-transitions-in-multiscale-systems

Acknowledgements.

The authors acknowledge Swedish Research Council grant no. 638-2013-9243. S.H. Lim is grateful to Stefano Bo for critical reading of the manuscript and many insightful discussions.

Appendix: Details on Training, Model Selection and Testing

We adopt the following approach for training, validation (model selection) and testing for the three examples studied in the paper.

From a given time series for the slow variable $x$ (consisting of at least $N+1+M$ data points), we assign a training set, a validation set (with a total of $N+1$ data points) and a test set (with $M$ data points). The training set and validation set are the only accessible (in-sample) data and we are not allowed to use the (out-of-sample) data in the test set, which records the occurrence of a rare event. From these accessible data, we first extract the driving signal according to (S1) in Algorithm III.1. There will be $N$ data points for the signal. We set aside the last $N_{v}$ data points of the extracted signal on the time interval $[(N-N_{v})\Delta t,(N-1)\Delta t]$ to build a validation set, which will be used for hyperparameter tuning and model selection.

For each realization of the ESN/DESN, our model selection procedure is as follows. We employ the classical static validation scheme (see cv_esn19 for details) and perform a grid search over a pre-chosen hyperparameter space, looking for hyperparameter values that minimize the root mean squared error (RMSE) on the validation set. To choose this space, we focus on optimizing over the number of reservoir elements ( $n_{x_{i}}$ ) in each layer, while fixing the desired spectral radius ( $\rho_{des}^{(i)}$ ), noise intensity ( $\nu$ ), and sparsity parameter ( $r_{i}$ ) to reasonable default values. This choice is informed by the practice that optimizing over $n_{x}$ (the memory capacity) should be prioritized over other hyperparameters lukovsevivcius2012practical . The values of all these hyperparameters are not fully optimized, i.e., an exhaustive search over the hyperparameters has not been performed. However, good average performance is typically not found in a very narrow parameter range, thus a very detailed fine-tuning of parameters does not give a significant improvement and is not necessary lukovsevivcius2012practical . An alternative model selection procedure is the more expensive automated Bayesian optimization approach proposed in yperman2016bayesian .

The optimal ESN/DESN models selected by the above procedure will depend on the choice of $N_{v}$ . The choice of $N_{v}$ should be guided by the size and characteristic of the accessible data: $N_{v}$ needs to be sufficiently large so that the validation error is a reliable estimate of out-of-sample error but should be small enough, keeping in mind the size of accessible data and the short-term forecasting capability of ESN/DESN. Different choices of $N_{v}$ will therefore give rise to different prediction models. Once an optimal model is chosen, we run the trained network to obtain predicted values for the driving signal on the time interval $[(N-N_{v})\Delta t,(N+M)\Delta t]$ (see (S2) in Algorithm III.1) and proceed to (S3) in Algorithm III.1. For numerical integration in (S3), we use a Runge-Kutta method with a uniform step size of $\Delta t=0.01$ .

Our prediction and testing procedure is as follows. We perform an ensemble-like prediction by repeating the above training and model selection procedure $N_{ens}$ times, obtaining $N_{ens}$ independently trained and optimized prediction models, whose networks are initialized using different random seeds. The idea of this ensemble algorithm is that independent training and model selection here produce different learners (committee of networks), learning on a diverse set of features. The predictions of several base models built with the same algorithm are then aggregated in order to improve robustness over a single model; there may be predictions that are completely off but these will partially cancel each other out. This increases stability, lowers the variance and optimizes the overall predictive performance. A higher (lower) $N_{ens}$ gives a more (less) confident estimator. The choice of $N_{ens}$ should not be too large to minimize computational expense in training models and to avoid diminishing returns in performance from adding more ensemble members. We choose $N_{ens}=50$ in all examples here. We have repeated the experiments with a larger $N_{ens}$ of 100 but find only slight improvement in performance.

Since some members of the ensemble will make better predictions than others, we expect to reduce the test error further if we assign greater weight to some members than to others. For this reason, we are taking weighted averaged values as the final predicted values as follows. Let $N_{eff}$ be the number of predicted trajectories that lie entirely inside a pre-chosen confidence interval (which should also cover the trajectory obtained from the validation set) around the mean predicted trajectory on the time interval defining the validation set. The weight ascribed to each predicted trajectory is set to $1/N_{eff}$ (getting a vote) if it falls entirely inside the confidence interval, and zero (getting no vote) otherwise. This filters out the outliers, leaving us with $N_{eff}\leq N_{ens}$ selected models that are expected to have reasonably good short-term forecasting capability, at least on the validation set. To compute the ensemble average in our experiments, we choose the confidence interval to be within 8 standard deviations of the mean predicted trajectory on the validation time interval (if none of the predicted trajectories falls inside this region, we increase the number of standard deviations used by one iteratively until at least one falls inside the region).

To evaluate the quality of the predictions of the $N_{eff}$ models, we study the statistics (the mean and standard deviation) of the difference between the predicted values and the target (real) values. Note that alternative measures for evaluating the predictive performance could also be considered. Our choice of measure here, in contrast with coarse-grained metrics such as the percentage of predicted trajectories that display a jump at a time close to the real one, reflects our focus on inferring the characteristics of the whole transition path. Figure 5 shows a schematic of our method.

The following are implementation details specific to each example.

•

Example 1. We apply Algorithm III.1-III.2 using a shallow ESN (i.e., $n_{L}=1$ and $L=0$ ) with $n_{u}=n_{y}=1$ , the bias value of $b=1$ , and $i_{transient}=0$ . The size of the validation set is chosen to be $N_{v}=10$ and $N_{v}=20$ . For the grid search, the range of values that we use for $n_{x_{0}}$ are from $600$ to $740$ , while $\rho^{(0)}_{des}=0.7$ , $r_{0}=0.1$ , and $\nu=0.001$ .

•

Example 2. We apply Algorithm III.1-III.2 using a three-layered DESN (i.e., $n_{L}=3$ , $L=2$ ) with $n_{u}=n_{y}=1$ , the bias value of $b=1$ , and $i_{transient}=0$ . The size of the validation set is chosen to be $N_{v}=4$ and $N_{v}=8$ . For the grid search, the range of values that we use for $n_{x_{i}}$ (for $i=0,1,2$ ) are from $100$ to $300$ , while $\rho_{des}^{(0)}=0.6$ , $\rho_{des}^{(1)}=0.7$ , $\rho_{des}^{(2)}=0.8$ , $r_{0}=r_{1}=r_{2}=0.05$ , and $\nu=0.003$ .

•

Example 3. We apply Algorithm III.1-III.2 using a shallow ESN (i.e., $n_{L}=1$ and $L=0$ ) with $n_{u}=n_{y}=1$ and the bias $b=1$ . The first 200 data points are not used (so $t_{0}=2$ ) and the first $i_{transient}=100$ hidden states of the ESN are discarded during training. The size of the validation set is chosen to be $N_{v}=5$ and $N_{v}=8$ . For the grid search, the range of values that we use for $n_{x_{0}}$ are from $500$ to $700$ , while $\rho^{(0)}_{des}=0.95$ , $r_{0}=0.1$ , and $\nu=0.003$ .

We invite interested reader to experiment with the choice of hyperparameters using the codes provided at the website in Data Availability.

References

(1)

S. Agarwal and J. S. Wettlaufer, Maximal stochastic transport in the Lorenz equations, Physics Letters A, 380 (2016), pp. 142–146.

(2)

L. Arnold, Hasselmann′s program revisited: the analysis of stochasticity in deterministic climate models, in Stochastic Climate Models, Springer, 2001, pp. 141–157.

(3)

P. Ashwin, S. Wieczorek, R. Vitolo, and P. Cox, Tipping points in open systems: bifurcation, noise-induced and rate-dependent examples in the climate system, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 370 (2012), pp. 1166–1184.

(4)

N. Baker, F. Alexander, T. Bremer, A. Hagberg, Y. Kevrekidis, H. Najm, M. Parashar, A. Patra, J. Sethian, S. Wild, et al., Workshop Report on Basic Research Needs for Scientific Machine Learning: Core Technologies for Artificial Intelligence, tech. rep., USDOE Office of Science (SC), Washington, DC (United States), 2019.

(5)

C. Beck, Brownian motion from deterministic dynamics, Physica A: Statistical Mechanics and its Applications, 169 (1990), pp. 324–336.

(6)

Y. Bengio et al., Learning deep architectures for AI, Foundations and Trends® in Machine Learning, 2 (2009), pp. 1–127.

(7)

R. Benzi, G. Parisi, A. Sutera, and A. Vulpiani, Stochastic resonance in climatic change, Tellus, 34 (1982), pp. 10–16.

(8)

R. Benzi, A. Sutera, and A. Vulpiani, The mechanism of stochastic resonance, Journal of Physics A: Mathematical and General, 14 (1981), p. L453.

(9)

N. Berglund and B. Gentz, Noise-Induced Phenomena in Slow-Fast Dynamical Systems: A Sample-Paths Approach, Springer Science & Business Media, 2006.

(10)

J. Berner, U. Achatz, L. Batté, L. Bengtsson, A. d. l. Cámara, H. M. Christensen, M. Colangeli, D. R. Coleman, D. Crommelin, S. I. Dolaptchiev, et al., Stochastic parameterization: toward a new view of weather and climate models, Bulletin of the American Meteorological Society, 98 (2017), pp. 565–588.

(11)

S. Bo, F. Schmidt, R. Eichhorn, and G. Volpe, Measurement of anomalous diffusion using recurrent neural networks, Phys. Rev. E, 100 (2019), p. 010102.

(12)

J. Brajard, A. Carrassi, M. Bocquet, and L. Bertino, Combining data assimilation and machine learning to emulate a dynamical model from sparse and noisy observations: a case study with the Lorenz 96 model, Geoscientific Model Development Discussions, 2019 (2019), pp. 1–21.

(13)

S. L. Brunton and J. N. Kutz, Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control, Cambridge University Press, 2019.

(14)

G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborová, Machine learning and the physical sciences, Reviews of Modern Physics, 91 (2019), p. 045002.

(15)

A. Chattopadhyay, P. Hassanzadeh, and D. Subramanian, Data-driven predictions of a multiscale Lorenz 96 chaotic system using machine-learning methods: reservoir computing, artificial neural network, and long short-term memory network, Nonlinear Processes in Geophysics, 27 (2020), pp. 373–389.

(16)

I. Chevyrev, P. K. Friz, A. Korepanov, I. Melbourne, and H. Zhang, Multiscale systems, homogenization, and rough paths, in International Conference in Honor of the 75th Birthday of SRS Varadhan, Springer, 2016, pp. 17–48.

(17)

A. B. Cohen, B. Ravoori, T. E. Murphy, and R. Roy, Using synchronization for prediction of high-dimensional chaotic dynamics, Phys. Rev. Lett., 101 (2008), p. 154102.

(18)

C. Cuchiero, L. Gonon, L. Grigoryeva, J.-P. Ortega, and J. Teichmann, Discrete-time signatures and randomness in reservoir computing, arXiv preprint arXiv:2010.14615, (2020).

(19)

W. E and E. Vanden-Eijnden, Towards a theory of transition paths, Journal of Statistical Physics, 123 (2006), p. 503.

(20)

I. Eisenman and J. S. Wettlaufer, Nonlinear threshold behavior during the loss of Arctic sea ice, Proceedings of the National Academy of Sciences, 106 (2009), pp. 28–32.

(21)

E. Forgoston and R. O. Moore, A primer on noise-induced transitions in applied dynamical systems, SIAM Review, 60 (2018), pp. 969–1009.

(22)

C. Gallicchio and A. Micheli, Deep echo state network (deepesn): a brief survey, arXiv preprint arXiv:1712.04323, (2017).

(23)

C. Gallicchio and S. Scardapane, Deep randomized neural networks, in Recent Trends in Learning From Data, Springer, 2020, pp. 43–68.

(24)

L. Gammaitoni, P. Hänggi, P. Jung, and F. Marchesoni, Stochastic resonance, Reviews of Modern Physics, 70 (1998), p. 223.

(25)

L. P. García, J. D. Pérez, G. Volpe, A. V. Arzola, and G. Volpe, High-performance reconstruction of microscopic force fields from Brownian trajectories, Nature Communications, 9 (2018), p. 5166.

(26)

P. Gaspard, M. Briggs, M. Francis, J. Sengers, R. Gammon, J. R. Dorfman, and R. Calabrese, Experimental evidence for microscopic chaos, Nature, 394 (1998), p. 865.

(27)

L. Giorgini, S. Lim, W. Moon, and J. Wettlaufer, Precursors to rare events in stochastic resonance, EPL (Europhysics Letters), 129 (2020), p. 40003.

(28)

D. Givon, R. Kupferman, and A. Stuart, Extracting macroscopic dynamics: model problems and algorithms, Nonlinearity, 17 (2004), p. R55.

(29)

L. Gonon, L. Grigoryeva, and J.-P. Ortega, Risk bounds for reservoir computing, arXiv preprint arXiv:1910.13886, (2019).

(30)

L. Gonon, L. Grigoryeva, and J.-P. Ortega, Approximation bounds for random neural networks and reservoir systems, arXiv preprint arXiv:2002.05933, (2020).

(31)

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT press, 2016.

(32)

G. A. Gottwald and I. Melbourne, Homogenization for deterministic maps and multiplicative noise, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469 (2013), p. 20130201.

(33)

L. Grigoryeva and J.-P. Ortega, Echo state networks are universal, Neural Networks, 108 (2018), pp. 495–508.

(34)

A. Hart, J. Hook, and J. Dawes, Embedding and approximation theorems for echo state networks, Neural Networks, (2020).

(35)

C. Hartmann, R. Banisch, M. Sarich, T. Badowski, and C. Schütte, Characterization of rare events in molecular dynamics, Entropy, 16 (2014), pp. 350–376.

(36)

K. Hasselmann, Stochastic climate models part I. Theory, Tellus, 28 (1976), pp. 473–485.

(37)

I. Ilies, H. Jaeger, O. Kosuchinas, M. Rincon, V. Sakenas, and N. Vaskevicius, Stepping forward through echoes of the past: forecasting with echo state networks, Short report on the winning entry to the NN3 financial forecasting competition, available online at http://www. neural-forecasting-competition. com/downloads/NN3/methods/27-NN3 Herbert Jaeger report. pdf, 53 (2007), p. 76.

(38)

H. Jaeger, Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach, vol. 5, GMD-Forschungszentrum Informationstechnik Bonn, 2002.

(39)

H. Jaeger and H. Haas, Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication, Science, 304 (2004), pp. 78–80.

(40)

D. T. Kaplan and L. Glass, Coarse-grained embeddings of time series: random walks, Gaussian random processes, and deterministic chaos, Physica D: Nonlinear Phenomena, 64 (1993), pp. 431–454.

(41)

D. Kelly and I. Melbourne, Deterministic homogenization for fast–slow systems with chaotic noise, Journal of Functional Analysis, 272 (2017), pp. 4063–4102.

(42)

D. Kelly, I. Melbourne, et al., Smooth approximation of stochastic differential equations, The Annals of Probability, 44 (2016), pp. 479–520.

(43)

C. Kuehn, A mathematical framework for critical transitions: bifurcations, fast–slow systems and stochastic dynamics, Physica D: Nonlinear Phenomena, 240 (2011), pp. 1020–1035.

(44)

Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324.

(45)

S. H. Lim, Understanding recurrent neural networks using nonequilibrium response theory, arXiv preprint arXiv:2006.11052, (2020).

(46)

E. N. Lorenz, Deterministic nonperiodic flow, Journal of the Atmospheric Sciences, 20 (1963), pp. 130–141.

(47)

E. N. Lorenz, Predictability: A problem partly solved, in Proc. Seminar on Predictability, vol. 1, 1996.

(48)

Z. Lu, B. R. Hunt, and E. Ott, Attractor reconstruction by machine learning, Chaos: An Interdisciplinary Journal of Nonlinear Science, 28 (2018), p. 061104.

(49)

M. Lukoševičius, A practical guide to applying echo state networks, in Neural Networks: Tricks of the Trade, Springer, 2012, pp. 659–686.

(50)

M. Lukoševičius and H. Jaeger, Reservoir computing approaches to recurrent neural network training, Computer Science Review, 3 (2009), pp. 127–149.

(51)

M. Lukoševičius and A. Uselis, Efficient cross-validation of echo state networks, in Artificial Neural Networks and Machine Learning – ICANN 2019: Workshop and Special Sessions, I. V. Tetko, V. Kůrková, P. Karpov, and F. Theis, eds., Cham, 2019, Springer International Publishing, pp. 121–133.

(52)

B. Lusch, J. N. Kutz, and S. L. Brunton, Deep learning for universal linear embeddings of nonlinear dynamics, Nature Communications, 9 (2018), p. 4950.

(53)

M. C. Mackey and M. Tyran-Kamińska, Deterministic Brownian motion: the effects of perturbing a dynamical system by a chaotic semi-dynamical system, Physics Reports, 422 (2006), pp. 167–222.

(54)

G. Manjunath and H. Jaeger, Echo state property linked to an input: exploring a fundamental characteristic of recurrent neural networks, Neural Computation, 25 (2013), pp. 671–696.

(55)

P. Mehta, M. Bukov, C.-H. Wang, A. G. Day, C. Richardson, C. K. Fisher, and D. J. Schwab, A high-bias, low-variance introduction to machine learning for physicists, Physics Reports, (2019).

(56)

I. Melbourne and A. Stuart, A note on diffusion limits of chaotic skew-product flows, Nonlinearity, 24 (2011), p. 1361.

(57)

L. Mitchell and G. A. Gottwald, Data assimilation in slow–fast systems using homogenized climate models, Journal of the Atmospheric Sciences, 69 (2012), pp. 1359–1377.

(58)

Doing so will usually, but not always, ensure that the resulting network satisfies the echo state property jaeger2002tutorial .

(59)

R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, How to construct deep recurrent neural networks, arXiv preprint arXiv:1312.6026, (2013).

(60)

J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott, Model-free prediction of large spatiotemporally chaotic systems from data: a reservoir computing approach, Phys. Rev. Lett., 120 (2018), p. 024102.

(61)

J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott, Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data, Chaos: An Interdisciplinary Journal of Nonlinear Science, 27 (2017), p. 121102.

(62)

J. Pathak, A. Wikner, R. Fussell, S. Chandra, B. R. Hunt, M. Girvan, and E. Ott, Hybrid forecasting of chaotic processes: using machine learning in conjunction with a knowledge-based model, Chaos: An Interdisciplinary Journal of Nonlinear Science, 28 (2018), p. 041101.

(63)

G. Pavliotis and A. Stuart, Multiscale Methods, vol. 53 of Texts in Applied Mathematics, Springer, New York, 2008.

(64)

L. M. Pecora and T. L. Carroll, Synchronization of chaotic systems, Chaos: An Interdisciplinary Journal of Nonlinear Science, 25 (2015), p. 097611.

(65)

M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics informed deep learning (part I): data-driven solutions of nonlinear partial differential equations, arXiv preprint arXiv:1711.10561, (2017).

(66)

P. Riley, Three pitfalls to avoid in machine learning, Nature, 572 (2019), pp. 27–29.

(67)

C. Rödenbeck, C. Beck, and H. Kantz, Dynamical systems with time scale separation: averaging, stochastic modelling, and central limit theorems, in Stochastic Climate Models, Springer, 2001, pp. 189–209.

(68)

D. Rolnick, P. L. Donti, L. H. Kaack, K. Kochanski, A. Lacoste, K. Sankaran, A. Slavin Ross, N. Milojevic-Dupont, N. Jaques, and A. Waldman-Brown, Tackling climate change with machine learning, arXiv e-prints, (2019), p. arXiv:1906.05433.

(69)

P. Romanczuk, M. Bär, W. Ebeling, B. Lindner, and L. Schimansky-Geier, Active Brownian particles, The European Physical Journal Special Topics, 202 (2012), pp. 1–162.

(70)

S. Scardapane and D. Wang, Randomness in neural networks: an overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7 (2017), p. e1200.

(71)

M. Scheffer, J. Bascompte, W. A. Brock, V. Brovkin, S. R. Carpenter, V. Dakos, H. Held, E. H. Van Nes, M. Rietkerk, and G. Sugihara, Early-warning signals for critical transitions, Nature, 461 (2009), pp. 53–59.

(72)

S. Scher, Toward data-driven weather and climate forecasting: approximating a simple general circulation model with deep learning, Geophysical Research Letters, 45 (2018), pp. 12,616–12,622.

(73)

A. Sherstinsky, Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network, Physica D: Nonlinear Phenomena, 404 (2020), p. 132306.

(74)

H. Stone, Calibrating rough volatility models: a convolutional neural network approach, Quantitative Finance, 20 (2020), pp. 379–392.

(75)

G. Tanaka, T. Yamane, J.B. Héroux, R. Nakane, N. Kanazawa, S. Takeda, H. Numata, D. Nakano, and A. Hirose, Recent advances in physical reservoir computing: A review, Neural Networks, 115 (2019), pp. 100–123.

(76)

P. Tino, Dynamical systems as temporal feature spaces., Journal of Machine Learning Research, 21 (2020), pp. 1–42.

(77)

P. Verzelli, C. Alippi, and L. Livi, Learn to synchronize, synchronize to learn, arXiv preprint arXiv:2010.02860, (2020).

(78)

P. Verzelli, C. Alippi, L. Livi, and P. Tino, Input representation in recurrent neural networks dynamics, arXiv preprint arXiv:2003.10585, (2020).

(79)

P. R. Vlachas, W. Byeon, Z. Y. Wan, T. P. Sapsis, and P. Koumoutsakos, Data-driven forecasting of high-dimensional chaotic systems with long short-term memory networks, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474 (2018), p. 20170844.

(80)

P. R. Vlachas, J. Pathak, B. R. Hunt, T. P. Sapsis, M. Girvan, E. Ott, and P. Koumoutsakos, Backpropagation algorithms and reservoir computing in recurrent neural networks for the forecasting of complex spatiotemporal dynamics, Neural Networks, (2020).

(81)

T. Weng, H. Yang, C. Gu, J. Zhang, and M. Small, Synchronization of chaotic systems and their machine-learning models, Phys. Rev. E, 99 (2019), p. 042203.

(82)

G. Yehudai and O. Shamir, On the power and limitations of random features for understanding neural networks, in Advances in Neural Information Processing Systems 32, 2019, pp. 6594–6604.

(83)

J. Yperman and T. Becker, Bayesian optimization of hyper-parameters in reservoir computing, arXiv preprint arXiv:1611.05193, (2016).

(84)

R. S. Zimmermann and U. Parlitz, Observing spatio-temporal dynamics of excitable media using reservoir computing, Chaos: An Interdisciplinary Journal of Nonlinear Science, 28 (2018), p. 043118.

Bibliography84

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) S. Agarwal and J. S. Wettlaufer , Maximal stochastic transport in the Lorenz equations , Physics Letters A, 380 (2016), pp. 142–146.
2(2) L. Arnold , Hasselmann ′ s program revisited: the analysis of stochasticity in deterministic climate models , in Stochastic Climate Models, Springer, 2001, pp. 141–157.
3(3) P. Ashwin, S. Wieczorek, R. Vitolo, and P. Cox , Tipping points in open systems: bifurcation, noise-induced and rate-dependent examples in the climate system , Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 370 (2012), pp. 1166–1184.
4(4) N. Baker, F. Alexander, T. Bremer, A. Hagberg, Y. Kevrekidis, H. Najm, M. Parashar, A. Patra, J. Sethian, S. Wild, et al. , Workshop Report on Basic Research Needs for Scientific Machine Learning: Core Technologies for Artificial Intelligence , tech. rep., USDOE Office of Science (SC), Washington, DC (United States), 2019.
5(5) C. Beck , Brownian motion from deterministic dynamics , Physica A: Statistical Mechanics and its Applications, 169 (1990), pp. 324–336.
6(6) Y. Bengio et al. , Learning deep architectures for AI , Foundations and Trends® in Machine Learning, 2 (2009), pp. 1–127.
7(7) R. Benzi, G. Parisi, A. Sutera, and A. Vulpiani , Stochastic resonance in climatic change , Tellus, 34 (1982), pp. 10–16.
8(8) R. Benzi, A. Sutera, and A. Vulpiani , The mechanism of stochastic resonance , Journal of Physics A: Mathematical and General, 14 (1981), p. L 453.

Scenarios	$N_{v} = 10$	$N_{v} = 20$
$N = 3700$	(1a)	(1b)
$N = 3690$	(2a)	(2b)
$N = 3680$	(3a)	(3b)
$N = 3670$	(4a)	(4b)

Scenarios	$N_{v} = 5$	$N_{v} = 8$
$N = 8233$	(1a)	(1b)
$N = 8231$	(2a)	(2b)

Scenarios	$N_{v} = 10$	$N_{v} = 20$
Baseline	(0a)	(0b)
$N = 3700$	(1a)	(1b)
$N = 3690$	(2a)	(2b)
$N = 3680$	(3a)	(3b)
$N = 3670$	(4a)	(4b)

(a)	(b)
(c)	(d)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Predicting Critical Transitions in Multiscale Dynamical Systems

Abstract

I Introduction

II Multiscale Noisy Systems and the Prediction Problem

III A Machine Learning Based Prediction Method

III.1 The method

Algorithm III.1**.**

III.2 Echo state network (ESN) and its deep version

III.2.1 Architecture of the ESN and its deep version

III.2.2 Algorithm for training the ESN and its deep version

Algorithm III.2**.**

IV Numerical Experiments

IV.1 Example 1: A bi-stable system, driven by a fast Lorenz-63 system

IV.2 Example 2: A tri-stable system, with periodic forcing and a fast Ornstein-Uhlenbeck-like process

IV.3 Example 3: A tri-stable system, subject to periodic forcing and driven by a multiscale Lorenz-96 system

V Conclusions

Data Availability

Acknowledgements.

Appendix: Details on Training, Model Selection and Testing

References

Algorithm III.1.

Algorithm III.2.