Data-driven prediction of a multi-scale Lorenz 96 chaotic system using   deep learning methods: Reservoir computing, ANN, and RNN-LSTM

Ashesh Chattopadhyay; Pedram Hassanzadeh; Devika Subramanian

arXiv:1906.08829·cs.LG·July 7, 2020

Data-driven prediction of a multi-scale Lorenz 96 chaotic system using deep learning methods: Reservoir computing, ANN, and RNN-LSTM

Ashesh Chattopadhyay, Pedram Hassanzadeh, Devika Subramanian

PDF

4 Repos

TL;DR

This study compares deep learning methods for predicting a multi-scale Lorenz 96 chaotic system, finding reservoir computing (RC-ESN) significantly outperforms ANN and RNN-LSTM in short-term forecasting and reproducing long-term statistics.

Contribution

It demonstrates the superior performance of reservoir computing over traditional neural networks in modeling complex chaotic systems with limited observable variables.

Findings

01

RC-ESN accurately forecasts chaotic trajectories for hundreds of time steps.

02

RNN-LSTM outperforms ANN in prediction accuracy.

03

Predicted data from RC-ESN and RNN-LSTM closely match true probability distributions.

Abstract

In this paper, the performance of three deep learning methods for predicting short-term evolution and for reproducing the long-term statistics of a multi-scale spatio-temporal Lorenz 96 system is examined. The methods are: echo state network (a type of reservoir computing, RC-ESN), deep feed-forward artificial neural network (ANN), and recurrent neural network with long short-term memory (RNN-LSTM). This Lorenz 96 system has three tiers of nonlinearly interacting variables representing slow/large-scale ( $X$ ), intermediate ( $Y$ ), and fast/small-scale ( $Z$ ) processes. For training or testing, only $X$ is available; $Y$ and $Z$ are never known or used. We show that RC-ESN substantially outperforms ANN and RNN-LSTM for short-term prediction, e.g., accurately forecasting the chaotic trajectories for hundreds of numerical solver's time steps, equivalent to several Lyapunov timescales. The…

Equations25

\frac{d X _{k}}{d t} = X_{k - 1} (X_{k + 1} - X_{k - 2}) + F - \frac{h c}{b} Σ_{j} Y_{j, k}

\frac{d X _{k}}{d t} = X_{k - 1} (X_{k + 1} - X_{k - 2}) + F - \frac{h c}{b} Σ_{j} Y_{j, k}

\frac{d Y _{j, k}}{d t} = - c b Y_{j + 1, k} (Y_{j + 2, k} - Y_{j - 1, k}) - c Y_{j, k} + \frac{h c}{b} X_{k} - \frac{h e}{d} Σ_{i} Z_{i, j, k}

\frac{d Z _{i, j, k}}{d t} = e d Z_{i - 1, j, k} (Z_{i + 1, j, k} - Z_{i - 2, j, k}) - g e Z_{i, j, k} + \frac{h e}{d} Y_{j, k}

r (t + Δ t) = tanh (A r (t) + W_{in} X (t)),

r (t + Δ t) = tanh (A r (t) + W_{in} X (t)),

W_{out} = ar g W_{out} min ∣∣ W_{out} \tilde{r} (t) - X (t) ∣∣ + α ∣∣ W_{out} ∣∣.

v (t + Δ t) = W_{out} \tilde{r} (t + Δ t),

v (t + Δ t) = W_{out} \tilde{r} (t + Δ t),

X (t + Δ t) = v (t + Δ t) .

\displaystyle X(t+\Delta t)=X(t)+\frac{1}{2}\big{[}3\Delta X(t)-\Delta X(t-\Delta t)\big{]}.

\displaystyle X(t+\Delta t)=X(t)+\frac{1}{2}\big{[}3\Delta X(t)-\Delta X(t-\Delta t)\big{]}.

e (t) = [\frac{∣∣ X _{t r u e} ( t ) - X _{p r e d} ( t ) ∣∣}{⟨ ∣∣ X _{t r u e} ( t ) ∣∣ ⟩}] .

e (t) = [\frac{∣∣ X _{t r u e} ( t ) - X _{p r e d} ( t ) ∣∣}{⟨ ∣∣ X _{t r u e} ( t ) ∣∣ ⟩}] .

E = \frac{1}{100Δ t} i = 0 \sum i = 100 e (i Δ t) .

E = \frac{1}{100Δ t} i = 0 \sum i = 100 e (i Δ t) .

g^{f} (t) = σ_{f} (W_{f} [h (t - 1), I (t)] + b_{f}),

g^{f} (t) = σ_{f} (W_{f} [h (t - 1), I (t)] + b_{f}),

g^{i} (t) = σ_{f} (W_{i} [h (t - 1), I (t)] + b_{i}),

\tilde{C} (t) = tanh (W_{c} [h (t - 1), I (t)] + b_{h}),

C (t) = g^{f} (t) C (t - 1) + g^{i} (t) \tilde{C} (t),

g^{o} (t) = σ_{h} (W_{h} [h (t - 1), I (t)] + b_{h}),

h (t) = g^{o} (t) tanh (C (t)),

Φ (t) = W_{oh} h (t),

X (t + Δ t) \approx Φ (t) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Data-driven prediction of a multi-scale Lorenz 96 chaotic system using deep learning methods: Reservoir computing, ANN, and RNN-LSTM

Ashesh Chattopadhyay

Department of Mechanical Engineering, Rice University

Pedram Hassanzadeh

Department of Mechanical Engineering, Rice University

Department of Earth Environmental and Planetary Sciences, Rice University

Devika Subramanian

Department of Electrical and Computer Engineering, Rice University

Department of Computer Science, Rice University

Abstract

In this paper, the performance of three deep learning methods for predicting short-term evolution and for reproducing the long-term statistics of a multi-scale spatio-temporal Lorenz 96 system is examined. The methods are: echo state network (a type of reservoir computing, RC-ESN), deep feed-forward artificial neural network (ANN), and recurrent neural network with long short-term memory (RNN-LSTM). This Lorenz 96 system has three tiers of nonlinearly interacting variables representing slow/large-scale ( $X$ ), intermediate ( $Y$ ), and fast/small-scale ( $Z$ ) processes. For training or testing, only $X$ is available; $Y$ and $Z$ are never known or used. We show that RC-ESN substantially outperforms ANN and RNN-LSTM for short-term prediction, e.g., accurately forecasting the chaotic trajectories for hundreds of numerical solver’s time steps, equivalent to several Lyapunov timescales. The RNN-LSTM and ANN show some prediction skills as well; RNN-LSTM bests ANN. Furthermore, even after losing the trajectory, data predicted by RC-ESN and RNN-LSTM have probability density functions (PDFs) that closely match the true PDF, even at the tails. The PDF of the data predicted using ANN, however, deviates from the true PDF. Implications, caveats, and applications to data-driven and data-assisted surrogate modeling of complex nonlinear dynamical systems such as weather/climate are discussed.

1 Introduction

Various components of the Earth system involve multi-scale, multi-physics processes and high-dimensional chaotic dynamics. These processes are often modeled using sets of strongly-coupled nonlinear partial differential equations (PDEs), which are solved numerically on supercomputers. As we aim to simulate such systems with increasing levels of fidelity, we need to increase the numerical resolutions and/or incorporate more physical processes from a wide range of spatio-temporal scales into the models. For example, in atmospheric modeling for predicting the weather and climate systems, we need to account for the nonlinear interactions across the scales of cloud microphysics processes, gravity waves, convection, baroclinic waves, synoptic eddies, and large-scale circulation, just to name a few, not to mention the fast/slow processes involved in feedbacks from the ocean, land, and cryosphere [1, 2, 3, 4, 5].

Solving the coupled systems of PDEs representing such multi-scale processes is computationally prohibitive for many practical applications. As a result, to make the simulations tractable, a few strategies have been developed, which mainly involve only solving for slow/large-scale variables and accounting for the fast/small-scale processes through surrogate models. In weather/climate models, for example, semi-empirical physics-based parameterizations are often used to represent the effects of processes such as gravity waves and moist convection in the atmosphere or sub-mesoscale eddies in the ocean [6, 7, 8, 5, 9]. A more advanced approach is “super-parameterization”, which for example, involves solving the PDEs of moist convection on a high-resolution grid inside each grid point of large-scale atmospheric circulation, whose governing PDEs (the Navier-Stokes equations) are solved on a coarse grid [10]. While computationally more expensive, super-parameterized climate models have been shown to outperform parameterized models in simulating some aspects of climate variability and extremes [11, 12, 13].

More recently, “inexact computing” has received attention from the weather/climate community. This approach involves reducing the computational cost of each simulation by decreasing the precision of some of the less-critical calculations [14, 15], thus allowing the saved resources to be reinvested, for example, in more simulations (e.g., for probabilistic predictions) and/or higher resolutions for critical processes [16, 17, 18, 19]. One type of inexact computing involves solving the PDEs of some of the processes with single- or half-precision arithmetic, which requires less computing power and memory compared to the conventional double-precision calculations [14, 20, 21].

In the past few years, the rapid algorithmic advances in artificial intelligence (AI), and in particular data-driven modeling, have been explored for improving the simulation and prediction of nonlinear dynamical systems [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]. One appeal of the data-driven approach is that fast/accurate data-driven surrogate models that are trained on data from high-fidelity, computationally demanding simulations can be used to accelerate and/or improve the prediction and simulation of complex dynamical systems. Furthermore, for some poorly understood processes for which observational data are available (e.g., clouds), data-driven surrogate models built using such data might potentially outperform physics-based surrogate models [22, 30]. Recent studies have shown promising results in using AI to build data-driven parameterizations for modeling of some atmospheric and oceanic processes [33, 34, 35, 36, 37, 38, 39, 40]. In the turbulence and dynamical systems communities, similarly encouraging outcomes have been reported [41, 42, 43, 44, 45, 46, 26, 47, 48, 49] .

The objective of the current study is to make progress toward addressing the following question: Which AI-based data-driven technique(s) can best predict the spatio-temporal evolution of a multi-scale chaotic system, when only the slow/large-scale variables are known (during training) and are of interest (to predict during testing)? We emphasize that unlike many other studies, our focus is not on reproducing long-term statistics of the underlying dynamical system (although that will be examined too), but on predicting the short-term trajectory from a given initial condition. Furthermore, we emphasize that, following the problem formulation introduced by [38], the system’s state-vector is only partially known and the fast/small-scale variables are unknown even during training.

Our objective is more clearly demonstrated using the canonical chaotic system that we will use as a testbed for the data-driven methods: A multi-scale Lorenz 96 system:

[TABLE]

This set of coupled nonlinear ODEs is a 3-tier extension of Lorenz’s original model [50] and has been proposed by Thornes et al. [18] as a fitting prototype for multi-scale chaotic variability of the weather/climate system and a useful testbed for novel methods. In these equations, $F=20$ is a large-scale forcing that makes the system highly chaotic; $b=c=e=d=g=10$ and $h=1$ are tuned to produce appropriate spatio-temporal variability in the three variables (see below). The indices $i,j,k={1,2,\dots 8}$ , thus $X$ has $8$ elements while $Y$ and $Z$ have $64$ and $512$ elements, respectively. Figure 1 shows examples of the chaotic temporal evolution of $X$ , $Y$ , and $Z$ obtained from directly solving Eqs. (1)-(3). The examples demonstrate that $X$ has large amplitudes and slow variability; $Y$ has relatively small amplitudes, high-frequency variability, and intermittency; and $Z$ has small amplitudes and high-frequency variability. In the context of atmospheric circulation, the slow variable $X$ can represent the low-frequency variability of the large-scale circulation while the intermediate variable $Y$ and fast variable $Z$ can represent synoptic/baroclinic eddies and small-scale convection, respectively. Similar analogies in the context of ocean circulation and various other natural or engineering systems can be found, making this multi-scale Lorenz 96 system a useful prototype to focus on.

Our objective is to predict the spatio-temporal evolution of $X(t)$ using a data-driven model that is trained on past observations of $X(t)$ (Figure 2). In the conventional approach of solving Eqs. (1)-(3) numerically, the governing equations have to be known, initial conditions for $Y(t)$ and $Z(t)$ have to be available, and the numerical resolution is dictated by the fast/small-scale variable $Z$ , leading to high computational costs. In the fully data-driven approach that is our objective here, the governing equations do not have to be known, $Y(t)$ and $Z(t)$ do not have to be observed at any time, and evolution of $X(t)$ is predicted just from knowledge of the past observations of $X$ , leading to low computational costs. To successfully achieve this objective, a data-driven method should be able to

Accurately predict the evolution of a chaotic system along a trajectory for some time, 2. 2.

Account for the effects of $Y(t)$ and $Z(t)$ on the evolution of $X(t)$ .

Inspired by several recent studies (that are discussed below), we have focused on evaluating the performance of three deep learning techniques in accomplishing (1) and (2). These data-driven methods are

•

RC-ESN: Echo state network (ESN), a specialized type of recurrent neural network (RNN), which belongs to the family of reservoir computing (RC),

•

ANN: A deep feed-forward artificial neural network,

•

RNN-LSTM: An RNN with long short-term memory (LSTM).

We have focused on these three methods because they have either shown promising performance in past studies (RC-ESN and ANN), or they are considered state of the art in learning from sequential data (RNN-LSTM). There is a growing number of studies focused on using deep learning techniques for data-driven modeling of chaotic and turbulent systems, for example to improve weather/climate modeling and prediction. Some of these studies have been referenced above. Below, we briefly describe three sets of studies with closest relevance to our objective and approach. Pathak and co-workers [43, 51, 52] have recently shown promising results with RC-ESN for predicting short-term spatio-temporal evolution (item 1) and in replicating attractors for the Lorenz 63 and Kuramoto-Sivashinsky equations. The objective of our paper (items 1-2) and the employed multi-scale Lorenz 96 system are identical to that of [38], who reported their ANN to have some skills in data-driven prediction of the spatio-temporal evolution of $X$ . Finally, Vlachas et al. [45] found an RNN-LSTM to have skills (though limited) for predicting the short-term spatio-temporal evolution (item 1) of a number of chaotic toy models such as the original Lorenz 96 system.

Here we aim to build on these pioneering studies and examine, side by side, the performance of RC-ESN, ANN, and RNN-LSTM in achieving (1) and (2) for the chaotic multi-scale Lorenz 96 system. We emphasize the need for such a side-by-side comparison that uses the exact same system as the testbed and metrics for assessing performance.

The structure of the paper is as follows: in section 2, the multi-scale Lorenz 96 system and the three deep learning methods are discussed; results on how these methods predict the short-term spatio-temporal evolution of $X$ and reproduce the long-term statistics of $X$ are presented in section 3; key findings and future work are discussed in section 4.

2 Materials and Methods

2.1 The multi-scale Lorenz 96 system

2.1.1 Numerical solution

We have used a 4th-order Runge-Kutta solver with time step $\Delta t=0.005$ to solve the system of Eqs. (1)-(3). The system has been integrated for $100$ million time steps to generate a large dataset for training, testing, and examining the robustness of the results. In the Results section, we often report time in terms of model time units (MTUs), where 1 MTU $=200\Delta t$ . In terms of Lyapunov timescale, 1 MTU in this system is $\approx 4.5/\lambda_{max}$ [18], where $\lambda_{max}$ is the largest positive Lyapunov exponent. In terms of the $e$ -folding decorrelation timescale ( $\tau$ ) of the leading principal component (PC1) of $X$ [53], we estimate 1 MTU $\approx 6.9\tau_{\mathrm{PC1}}$ . Finally, as discussed in Thornes et al. [18], 1 MTU in this system corresponds to $\approx 3.75$ Earth days in the real atmosphere.

2.1.2 Training and testing datasets

To build the training and testing datasets from the numerical solution, we have sampled $X\in\Re^{8}$ at every $\Delta t$ and then standardized the data by subtracting the mean and dividing by the standard deviation. We then construct a training set consisting of $N$ sequential samples and a testing consisting of the next $2000$ sequential samples from point $N+1$ to $N+2000$ . We have randomly chosen $100$ such training/testing sets, each of length $(N+2000)\Delta t$ starting from a random point and separated from the next training/testing set by at least $2000\Delta t$ . There is never any overlap between the training and the testing sequences in any of the $100$ training/testing sets.

2.2 Reservoir computing-echo state network (RC-ESN)

2.2.1 Architecture

The RC-ESN [54, 55] is an RNN that has a reservoir with $D$ neurons, which are sparsely connected in an Erdős-Rényi graph configuration (see Figure 3). The connectivity of the reservoir neurons is represented by the adjacency matrix $\mathbf{A}$ of size $D\times D$ whose values are drawn from a uniform random distribution on the interval $\left[-1,1\right]$ . The state of the reservoir, representing the activations of its constituent neurons, is a vector $r(t)\in\Re^{D}$ . The typical reservoir size used in this study is $D=5000$ (but we have experimented with $D$ as large as $20000$ , as discussed later). The other two components of the RC-ESN are an input-to-reservoir layer with weight matrix $\mathbf{W_{in}}$ , and a reservoir-to-output layer with weight matrix $\mathbf{W_{out}}$ . The inputs for training are $N$ sequential samples of $X(t)\in\Re^{8}$ . At the beginning of the training phase, $\mathbf{A}$ and $\mathbf{W_{in}}$ are chosen randomly and are fixed; i.e., they do not change during training or testing. During training, only the weights of the output-to-reservoir layer ( $\mathbf{W_{out}}$ ) are updated.

$\mathbf{W_{out}}$ is the only trainable matrix in this network. This architecture yields a simple training process that has two key advantages:

•

It does not suffer from the vanishing and the exploding gradient problem, which has been a major difficulty in training RNNs, especially before the advent of LSTMs [56],

•

$\mathbf{W_{out}}$ can be computed in one shot (see below), thus this algorithm is orders of magnitude faster than the backpropagation through time (BPTT) algorithm [57], which is used for training general RNNs and RNN-LSTMs (see section 2.4).

The equations governing the RC-ESN training process are as follows:

[TABLE]

Here $||\cdot||$ is the $L_{2}$ -norm of a vector and $\alpha$ is the $L_{2}$ regularization (ridge regression) constant. Equation (4) maps the observable $X(t)\in\Re^{8}$ to the higher dimensional reservoir’s state $r(t+\Delta t)\in\Re^{D}$ (reminder: $D$ is $O(1000)$ for this problem). Note that Eq. (5) contains $\tilde{r}(t)$ rather than $r(t)$ . As observed by Pathak et al. [43] for the Kuramoto-Sivashinsky equation, and observed by us in this work, the columns of the matrix $\tilde{\mathbf{r}}$ should be chosen as nonlinear combinations of the columns of the reservoir state matrix ${\mathbf{r}}$ (each row of the matrix $\mathbf{r}$ is a snapshot $r(t)$ of size $D$ ). For example, following Pathank et al. [43], we compute $\tilde{\mathbf{r}}$ as having the same even columns as ${\mathbf{r}}$ while its odd columns are the square of the odd columns of ${\mathbf{r}}$ (algorithm $T_{1}$ hereafter). As shown in Appendix A, we have found that a nonlinear transformation (between ${\mathbf{r}}$ and $\tilde{\mathbf{r}}$ ) is essential for skillful predictions, while several other transformation algorithms yield similar results as $T_{1}$ . The choices of these transformations ( $T_{2}$ and $T_{3}$ , see Appendix A), although not based on a rigorous mathematical analysis, are inspired from the nature of the quadratic nonlinearity that is present in Eqs. (1)-(3).

The prediction process is governed by:

[TABLE]

$\tilde{r}(t+\Delta t)$ in Eq. (6) is computed by applying one of the $T_{1}$ , $T_{2}$ or $T_{3}$ algorithms on $r(t+\Delta t)$ , which itself is calculated via Eq. (4) from $X(t)$ that is either known from initial condition or has been previously predicted.

See [54, 55, 58, 59] and references therein for further discussions on RC-ESNs, and [52, 51, 42, 43, 60, 61, 62, 49, 31] for examples of recent applications to dynamical systems.

2.2.2 Training and Prediction

During training ( $-T\leq t\leq 0$ ), $\mathbf{W_{in}}$ and $\mathbf{A}$ are initialized with random numbers, which stay fixed during the training (and testing) process, The state matrix $\mathbf{r}$ is initialized to [math]. During initialization, we ensure that the spectral radius of $\mathbf{A}$ is less than unity by first dividing the matrix $\mathbf{A}$ with its largest eigenvalue $\Lambda$ and further multiplying it by a scalar ( $\rho\leq 1$ ). Then the state matrix $\mathbf{r}$ is computed using Eq. (4) for the training set, and $\mathbf{W_{out}}$ is computed using Eq. (5). During prediction (i.e., testing) corresponding to $t>0$ , the computed $\mathbf{W_{out}}$ is used to march $v(t)$ (and thus $X(t)$ ) forward in time (Eqs. (6)-(7)) while as mentioned earlier, $r(t)$ keeps getting updated using the predicted $X(t)$ (Eq. (4)). A non-linear transformation is used to compute $\tilde{r}(t)$ from $r(t)$ before using Eq. (6).

Our RC-ESN architecture and training/prediction procedure are similar to the ones used in [43]. There is no overfitting in the training phase because the final training and testing accuracies are the same. Our code is developed in Python and is made publicly available (see Code and data availability).

2.3 Feed-forward artificial neural network (ANN)

2.3.1 Architecture

We have developed a deep ANN that has the same architecture as the one used in Duben and Bauer [38] (which they employed to conduct prediction on the same multi-scale Lorenz 96 studied here). The ANN has $4$ hidden layers with $100$ neurons each, and $8$ neurons in the input and output layers (Figure 4). It must be highlighted that unlike RC-ESN (and RNN-LSTM), this ANN is stateless, i.e., there is no hidden variable such as $r(t)$ that tracks temporal evolution. Furthermore, unlike the RC-ESN, for which the input and output are $X(t)$ and $X(t+\Delta t)$ , following Duben and Bauer [38], the ANN’s inputs/outputs are chosen to be pairs of $X(t)$ and $\Delta X(t)=X(t+\Delta t)-X(t)$ (Figure 4). During prediction, using $X(t)$ that is known from initial condition or previously calculated, $\Delta X(t)$ is predicted. $X(t+\Delta t)$ is then computed via Adams-Bashforth integration scheme

[TABLE]

More details on the ANN architecture has been reported in Appendix B.

2.3.2 Training and Prediction

Training is performed on $N$ pairs of sequential $(X(t),\Delta X(t))$ from the training set. During training, the weights of the network are computed using backpropagation optimized by the stochastic gradient descent algorithm. During prediction, as mentioned above, $\Delta X(t)$ is predicted from $X(t)$ that is known from initial condition or has been previously predicted, and $X(t+\Delta t)$ is then calculated using Eq. (8).

Our ANN architecture and training/prediction procedure are similar to the ones used in Duben and Bauer [38] (we have optimized, by trial and error, the hyperparameters for this particular network). There is no overfitting in the training phase because the final training and testing accuracies are the same. Our code is developed in Keras and is made publicly available (see Code and data availability).

2.4 Recurrent neural network with long short-term memory (RNN-LSTM)

2.4.1 Architecture

The RNN-LSTM [63] is a deep learning algorithm most suited for prediction of sequential data such as time series, and has received a lot of attention in recent years [57]. Variants of RNN-LSTMs are the best performing models for time series modeling in areas such as stock pricing [64], supply chain [65], natural language processing [66], and speech recognition [67]. A major improvement over regular RNNs, which have issues with the vanishing and the exploding gradients [56], LSTMs have become the state-of-the-art approach for training RNNs in the deep learning community. Unlike regular RNNs, the RNN-LSTMs have gates that control the information flow into the neural network from previous time steps of the time series. The RNN-LSTM used here, like the RC-ESN but unlike the ANN, is stateful (i.e., actively maintains state). Our RNN-LSTM has $50$ hidden layers in each RNN-LSTM cell. More details on our RNN-LSTM are presented in C. RNN-LSTMs have many tunable parameters and are trained with the expensive BPTT algorithm [57].

2.4.2 Training and Prediction

The input to the RNN-LSTM is a time-delay-embedded matrix of $X(t)$ with embedding dimension $q$ (also known as lookback) [68]. An extensive hyperparameter optimization (by trial and error) is performed to find the optimal value of $q$ for which the network has the largest prediction horizon (exploroing $q=1-22$ , we found $q=3$ to yield the best performance). The RNN-LSTM predicts $X(t+\Delta t)$ from the previous $q$ time steps of $X(t)$ . This is in contrast with RC-ESN, which only uses $X(t)$ and reservoir state $r(t)$ to predict $X(t+\Delta t)$ , and ANN, which only uses $X(t)$ and no state to predict $X(t+\Delta t)$ (via predicting $\Delta X(t)$ ). The weights of the LSTM layers are determined during the training process (see C). During testing, $X(t+\Delta t)$ is predicted using the past $q$ observables $\left[X(t-(q-1)\Delta t)\dots X(t-\Delta t),X(t)\right]$ that are either known from initial condition or have been previously predicted. We have found the best results with a stateless LSTM (which means that the state of the LSTM gets refreshed during the beginning of each batch during training, see C) that outperforms a stateful LSTM (where the states gets carried over to the next batch during training).

The architecture of our RNN-LSTM is similar to the one used in Vlachas et al. [45]. There is no overfitting in the training phase because the final training and testing accuracies are the same. Our code is developed in Keras and is made publicly available (see Code and data availability).

3 Results

3.1 Short-term prediction: Comparison of the RC-ESN, ANN, and RNN-LSTM performances

The short-term prediction skills of the three deep learning methods for the same training/testing sets are compared in Figure 5. Given the chaotic nature of the system, the performance of the methods depends on the initial condition from which the prediction is conducted. To give the readers a comprehensive view of the performance of these methods, Figures 5(A) and (B) show examples of the predicted trajectories (for one element of $X(t)$ ) versus the true trajectory for two specific initial conditions (from the $100$ initial conditions we used): the one for which RC-ESN shows the best performance (Figure 5(A)), and the one for which RC-ESN shows the worst performance (Figure 5(B)).

As seen in Figure 5(A), RC-ESN accurately predicts the time series for over $2.3$ MTU, which is equivalent to $460\Delta t$ and over $10.35$ Lyapunov timescales. Closer examination shows that the RC-RSN prediction follows the true trajectory well even up to $\approx 4$ MTU. The RNN-LSTM has the next best prediction performance (up to around $0.9$ MTU or $180\Delta t$ ). The prediction from ANN is for around $0.6$ MTU or $120\Delta t$ . For the example in Figure 5(B), all methods have shorter prediction horizons, but RC-ESN still has the best performance (accurate prediction up to $\approx 0.7$ MTU), followed by ANN and RNN-LSTM with similar prediction accuracies ( $\approx 0.3$ MTU).

Searching through all $100$ initial conditions, the best prediction with RNN-LSTM is up to $\approx 1.7$ MTU (equivalent to $340\Delta t$ ), and the best prediction with ANN is up to $\approx 1.2$ MTU (equivalent to $240\Delta t$ ),

To compare the results over all $100$ randomly chosen initial conditions, we have defined an averaged relative $L_{2}$ error between the true and predicted trajectories

[TABLE]

Here $\left[\cdot\right]$ and $\langle\cdot\rangle$ indicate, respectively, averaging over $100$ initial conditions and over $2000\Delta t$ . To be clear, $X_{true}(t)$ refers to the data at time $t$ obtained from the numerical solution while $X_{pred}(t)$ refers to the predicted value at $t$ using one of the deep learning methods. Figure 5(C) compares $e(t)$ for the three methods. It is clear that RC-ESN significantly outperforms ANN and RNN-LSTM for short-term prediction in this multi-scale chaotic testbed. Figure 6 shows an example of the spatio-temporal evolution of $X_{pred}$ (from RC-ESN), $X_{true}$ , and their difference, which further demonstrates the capabilities of RC-ESN for short-term spatio-temporal prediction.

3.2 Short-term prediction: Scaling of RC-ESN and ANN performance with training size $N$

How the performance of deep learning techniques scales with the size of the training set is of significant practical importance as the amount of available data is often limited in many problems. Given that currently there is no theoretical understanding of such scaling for these deep learning techniques, we have empirically examined how the quality of short-term predictions scales with $N$ . We have conducted the scaling analysis for $N=10^{4}$ to $N=2\times 10^{6}$ for the three methods. Two metrics for the quality of prediction are used: the prediction horizon, defined as when the averaged $L_{2}$ error $e(t)$ reaches $0.3$ , and the prediction error $E$ , defined as the average of $e(t)$ between $0-0.5$ MTU:

[TABLE]

Figure 7(A) shows that for all methods, the prediction horizon increases monotonically but nonlinearly as we increase $N$ . The prediction horizons of RC-ESN and ANN appear to saturate after $N=10^{6}$ , although the RC-ESN has a more complex step-like scaling curve that needs further examination in future studies. The prediction horizon of RC-ESN exceeds that of ANN by factors ranging from $3$ (for high $N$ ) to $9$ (for low $N$ ). In the case of RNN-LSTM, the factor ranges from $1.2$ (for high $N$ ) to $2$ (for low $N$ ). Figure 7(B) shows that for all methods, the average error $E$ decreases as $N$ increases (as expected), most notably for ANN when $N$ is small.

Overall, compared to both RNN-LSTM and ANN, the prediction horizon and accuracy of RC-ESN have a weaker dependence on the size of the training set, which is a significant advantage for RC-ESN when the dataset available for training is short, which is common in many practical problems.

3.3 Short-term prediction: Scaling of RC-ESN performance with reservoir size $D$

Given the superior performance of RC-ESN for short-term prediction, here we focus on one concern with this method: the need for large reservoirs, which can be computationally demanding. This issue has been suggested as a potential disadvantage of ESNs versus LSTMs for training RNNs [55]. Aside from the observations here that RC-ESN significantly outperforms RNN-LSTM for short-term predictions, the problem of reservoir size can be tackled in at least two ways. First, Pathak et al. [43] have proposed, and shown the effectiveness of, using a set of parallelized reservoirs, which allowed for easily dealing with high-dimensional chaotic toy models.

Second, Figure 8 shows that $E$ rapidly declines by a factor of around $3$ as $D$ is increased from $500$ to $5000$ , decreases slightly as $D$ is further doubled to $10000$ , and then barely changes as $D$ is doubled again to $20000$ . Training the RC-ESN with $D=20000$ versus $D=5000$ comes with a higher computational cost ( $16$ GB memory and $\approx 6$ CPU hours for $D=5000$ and $64$ GB memory and $\approx 18$ CPU hours for $D=20000$ ), while little improvement in accuracy is gained. Thus, concepts from inexact computing can be used to choose $D$ such that precision is traded for large savings in computational resources, which can be then reinvested into more simulations, higher resolutions for critical processes etc. [14, 15, 17, 21, 18].

3.4 Long-term statistics: Comparison of RC-ESN, ANN, and RNN-LSTM performance

All the data-driven predictions discussed earlier eventually diverge from the true trajectory (as would even predictions using the numerical solver). Still, it is interesting to examine whether the freely predicted spatio-temporal data have the same long-term statistical properties as the actual dynamical system (i.e., Eqs. (1)-(3)). Reproducing the actual dynamical system’s long-term statistics (sometimes refer to as the system’s climate) using a data-driven method can be significantly useful. In some problems, a surrogate model does not need to predict the evolution of a specific trajectory, but only the long-term statistics of a system. Furthermore, synthetic long datasets produced using an inexpensive data-driven method (trained on a shorter real dataset) can be used to examine the system’s probability density functions (PDFs), including its tails, which are important for studying the statistics of rare/extreme events.

By examining return maps, Pathak et al. [51] have already shown that RC-ESNs can reproduce the long-term statistics of the Lorenz 63 and Kuramoto-Sivashinsky equations (Jaegeret al. [54] and Pathak et al. [51] have also shown that RC-ESNs can be used to accurately estimate a chaotic system’s Lyapunov spectrum). Here, we focus on comparing the performance of RC-ESN, ANN, and RNN-LSTM in reproducing the system’s long-term statistics by

•

Examining the estimated PDFs and in particular their tails,

•

Investigating whether the quality of the estimated PDFs degrades with time, which can negate the usefulness of long synthetic datasets.

Figure 9 compares the estimated PDFs obtained using the three deep learning methods. The data predicted using RC-ESN and RNN-LSTM are found to have PDFs closely matching the true PDF, even at the tails. Deviations at the tails of the PDF predicted by these methods from the true PDF are comparable to the deviations of the PDFs obtained from true data using the same number of samples. The ANN-predicted data have reasonable PDFs between $\pm 2$ standard deviations, but the tails have substantial deviations from those of the true PDFs. All predicted PDFs are robust and do not differ much (except near the end of the tails) among the quartiles.

The results show that RC-ESN and RNN-LSTM can accurately reproduce the system’s long-term statistics, and can robustly produce long synthetic datasets with PDFs that are close to the PDF of the true data even near the tails. The ability of ANN to accomplish these tasks is limited.

4 Discussion

By examining the true and predicted trajectories (Figures 5-6) and the prediction errors and horizons (Figure 7), we have shown that RC-ESN substantially outperforms ANN and RNN-LSTM in predicting the short-term evolution of a multi-scale Lorenz 96 chaotic system (Eqs. (1)-(3)). Additionally, RC-ESN and RNN-LSTM both work well in reproducing the long-term statistics of this system (Figure 9). We emphasize that following the problem formulation of Duben and Bauer et al. [38], and unlike most other studies, only part of the multi-scale state-vector (the slow/large-scale variable $X$ ) has been available for training the data-driven model and has been of interest for testing. This problem design is more relevant to many practical problems but is more challenging as it requires the data-driven model to not only predict the evolution of $X$ based on its past observations, but also to account for the effects of the intermediate and fast/small-scale variables $Y$ and $Z$ .

We have also found that the prediction horizon, and in particular the prediction accuracy, of RC-ESN to have a weak dependence on the size of the training set (Figure 7). This is an important property, as in many practical problems the data available for training are limited. Furthermore, the prediction error of RC-ESN is shown to have an asymptotic behavior for large reservoir sizes, which suggests that reasonable accuracy can be achieved with moderate reservoir sizes. Note that the skillful predictions with RC-ESN in Figures 5-6 have been obtained with a moderate-sized training set ( $N=500000$ ) and reservoir ( $D=5000$ ). Figures 7 and 8 suggest that slightly better results could have been obtained using larger $N$ and $D$ , although such improvements come with a higher computational cost during training.

The order we have found for the performance of the three deep learning methods (RC-ESN $>$ RNN-LSTM $>$ ANN) is different from the order of complexity of these methods in terms of the number of trainable parameters (RC-ESN $<$ ANN $<$ RNN-LSTM) but aligned with the order in terms of the learnable function space (ANN $<$ RC-ESN and RNN-LSTM, as the latter methods are both RNNs and thus Turning complete, while ANN is a state-less feed-forward network and thus not Turing complete [70]). Whether the superior predictive performance of RC-ESN (especially over RNN-LSTM) is due to its explicit state representation updated at every time step both during training and testing, or lower likelihood of overfitting due to the order-of-magnitude smaller number of trainable parameters, remains to be seen. Also whether we have not fully harnessed the power of RNN-LSTMs (see below) is unclear at this point, particularly because a complete theoretical understanding of how/why these methods work (or do not work) is currently lacking. That said, there have been some progress in understanding the RC-ESNs, in particular by modeling the network itself as a dynamical system [71, 59]. Such efforts, for example those aimed at understanding the echo states that are learned in the RC-ESN’s reservoir, might benefit from recent advances in dynamical systems theory [72, 73, 74, 75, 76, 77].

An important next step in our work is to determine how generalizable our findings are. This investigation is important for the following two reasons. First, here we have only studied one system, a specially designed version of the Lorenz 96 system. The performance of these methods should be examined in a hierarchy of chaotic dynamical systems and high-dimensional turbulent flows. That said, our findings are overall consistent with the recently reported performance of these methods applied to chaotic toy models. [43, 51, 52] demonstrated that RC-ESN can predict, for several Lyapunov timescales, the spatio-temporal evolution, and can reproduce the climate of the Lorenz 63 and Kuramoto-Sivashinsky chaotic systems. Here, we have shown that RC-ESN performs similarly well even when only the slow/large-scale component of the multi-scale state-vector is known. Our results with ANN are consistent with those of Dueben and Bauer [38], who showed examples of trajectories predicted accurately up to $1$ MTU with a large training set $N=2\times 10^{6}$ using ANN for the same Lorenz 96 system (see their Fig. 1). While RNN-LSTM is considered state-of-the-art for sequential data modeling, and has worked well for a number of applications involving time series, to the best of our knowledge, simple RNN-LSTMs, such as the one used here, have not been very successful when applied to chaotic dynamical systems. Vlachas et al. [45] found some prediction skills using RNN-LSTM applied to the original Lorenz 96 system (which does not have the multi-scale coupling); see their Fig. 5.

Second, we have only used a simple RNN-LSTM; there are other variants of this architecture as well as more advanced deep learning RNNs that might potentially yield better results. For our simple RNN-LSTM, we have extensively explored optimization of hyperparameters and tried variant formulations of the problem: predict $\Delta X(t)$ with or without updating the state of the LSTM. We have found that such variants do not lead to better results, consistent with the findings of Vlachas et al. [45]. Our preliminary explorations with more advanced variants of RNN-LSTM (seq2seq and encoder-decoder LSTM [78]) have not resulted in any improvement either. However, just as the skillful predictions of RC-ESN and ANN hinge on one key step (nonlinear transformation for the former and predicting $\Delta X$ for the latter), it is possible that changing/adding one step leads to major improvements in RNN-LSTM (it is worth mentioning that similar nonlinear transformation of the state in LSTMs is not straightforward due to their more complex architecture; see Appendix C). We have shared our codes publicly to help others explore such changes to our RNN-LSTM. Furthermore, there are other more sophisticated RNNs that we have not explored. For example [79] have introduced a tensor-train RNN that outperformed a simple RNN-LSTM in predicting the temporal evolution of the Lorenz 63 chaotic system. Mohan et al.. [46] showed that a compressed convolutional LSTM performs well in capturing the long-term statistics of a turbulent flow. The performance of more advanced deep learning RNNs should be examined and compared, side by side, with the performance of RC-ESNs and ANNs in future studies. The multi-scale Lorenz 96 system provides a suitable starting point for such comparisons.

The results of our work show the promise of deep learning methods such as RC-ESNs (and to some extent RNN-LSTM) for data-driven modeling of chaotic dynamical systems, which has broad applications in geosciences, e.g., in weather/climate modeling. Practical and fundamental issues such as interpretability, scalability to higher dimensional systems [43], presence of measurement noise in the training data and initial conditions [44], non-stationarity of the time series, and dealing with data that have two or three spatial dimensions (e.g., through integration with convolutional neural networks, CNN-LSTM [80] and CNN-ESN [81] should be studied in future work.

Here we have focused on a fully data-driven approach, as opposed to the conventional approach of direct numerical solutions (Figure 2). In practice, for example for large-scale, multi-physics, multi-scale dynamical systems such as weather and climate models, it is likely that a hybrid framework yields the best performance: depending on the application and the spatio-temporal scales of the physical processes involved [18, 82], some of the equations could be solved numerically with double precision, some could be solved numerically with lower precisions, and some could be approximated with a surrogate model learned via a data-driven approach, such as the ones studied in this paper.

Appendix A More details on RC-ESN

Here we show a comparison of RC-ESN forecast skills with nonlinear transformation algorithms $T_{1}$ [43], $T_{2}$ and $T_{3}$ , and without any transformation between $r(t)$ and $\tilde{r}(t)$ . The three algorithms are (for $i=1,2,3\dots N$ and $j=1,2,3\dots D$ )

Algorithm $T_{1}$

$\tilde{r}_{i,j}=r_{i,j}\times r_{i,j}$ if $j$ is odd

$\tilde{r}_{i,j}=r_{i,j}$ if $j$ is even

Algorithm $T_{2}$

$\tilde{r}_{i,j}=r_{i,j-1}\times r_{i,j-2}$ if $j>1$ is odd

$\tilde{r}_{i,j}=r_{i,j}$ if $j$ is $1$ or even

Algorithm $T_{3}$

$\tilde{r}_{i,j}=r_{i,j-1}\times r_{i,j+1}$ if $j>1$ is odd

$\tilde{r}_{i,j}=r_{i,j}$ if $j$ is $1$ or even

Figure 10(A) shows an example of short-term predictions from an initial condition using $T_{1}$ , $T_{2}$ , $T_{3}$ , and no transformation, everything else kept the same. It is clear that the nonlinear transformation is essential for skillful predictions, as the prediction obtained without transformation diverges from the truth before $0.25$ MTU. The three nonlinear transformation algorithms yield similar results, with accurate predictions for more than $2$ MTU. Figure 10(B), which shows the relative prediction error averaged over $100$ initial conditions, further confirms this point.

Why the nonlinear transformation is needed, and the best/optimal choice for the transformation (if it exists) should be studied in future work. We highlight that the nonlinear transformation resembles basis function expansion, which is commonly used to capture nonlinearity in linear regression models [83].

Appendix B More details on ANN

The ANN used in this study (and the one that yields the best performance) is similar to that of Duben and Bauer et al. [38]. The ANN has four hidden layers each of which has $100$ neurons with a $\tanh$ activation function. The input to the ANN has $8$ neurons which takes $X(t)$ as the input and outputs $\Delta X(t)$ which is also a $8$ -neuron layer. The obtained $\Delta X(t)$ output is then used in an Adam-Bashforth integration scheme to obtain $X(t+\Delta t)$ . We found that a simpler Euler scheme yields similar results. However, we found that training/testing on pairs of $X(t)$ and $X(t+\Delta t$ ) (that was used for RC-ESN) leads to no prediction skill with ANN, and that following the procedure used in Dueben and Bauer [38] is essential for skillful predictions. We speculate that this might be due to the stateless nature of the ANN. By relating $X(t)$ to the change in $X(t)$ at the next time step, rather than the raw value $X(t+\Delta t)$ , this ANN training architecture implicitly contains a reference to the previous time step. While the ANN itself is stateless, this particular training approach essentially encodes a first-order temporal dependence between successive states.

Note that our approach here is the same as the global ANN of Duben and Bauer et al. [38]. We also tried their local ANN approach, but consistent with their findings for the Lorenz system, found better performance with the global approach (results not reported for brevity).

The ANN is trained with a stochastic gradient descent algorithm with a learning rate of $0.001$ with a batch size of $100$ and mean absolute error as loss function (mean squared error also gives similar performance).

Appendix C More details on RNN-LSTM

The governing equations for RNN-LSTM are:

[TABLE]

$\sigma_{f}$ is the softmax activation function; $g^{f}(t)$ , $g^{i}(t)$ , and $g^{o}(t)$ $\in\Re^{d_{h}\times\left(d_{h}+d_{i}\right)}$ are the forget gate, input gate, and output gate respectively. $d_{h}$ is the dimension of the hidden layers (chosen as $50$ in our study) and $d_{i}$ is the dimension of the input ( $8\times q$ ). $b_{f}$ , $b_{i}$ , and $b_{h}$ are the biases in the forget gate, input gate, and the hidden layers. $I(t)\in\Re^{d_{i}}$ is the input, which is a column of a time-delay-embedded matrix of $X(t)$ . This matrix has the dimension of $(8\times q)\times N$ . $h(t)\in\Re^{d_{h}}$ is the hidden state and $C(t)\in\Re^{d_{h}}$ is the cell state (the states track the temporal evolution). The weights $\mathbf{W_{o}}$ , $\mathbf{W_{i}}$ , $\mathbf{W_{f}}$ , $\mathbf{W_{c}}$ , and $\mathbf{W_{oh}}$ are learned through the BPTT algorithm [57] using an ADAM optimizer [84]. $\Phi(t)$ is the output from the RNN-LSTM.

The LSTM used in this study is a stateless LSTM, wherein the two hidden states ( $C$ and $h$ ) are refreshed at the beginning of each batch during training (this is not the same as the stateless ANN, where there is no hidden state at all). Here, the training sequences in each batch are shuffled randomly, leading to an unbiased gradient estimator in the stochastic gradient descent algorithm [85].

**Code and data availability:**All data and codes can be accessed at https://github.com/ashesh6810/RCESN_spatio_temporal.

Acknowledgment: This work was supported by NASA grant 80NSSC17K0266 and an Early-Career Research Fellowship from the Gulf Research Program of the National Academies of Science, Engineering, and Medicine (to P.H.). A.C. thanks the Rice University Ken Kennedy Institute for a BP HPC Graduate Fellowship. Computational resources on Bridge Large, Bridge GPU, and Stampede2 clusters were provided by the NSF XSEDE allocation ATM170020. We are grateful to Krishna Palem for many stimulating discussions and helpful comments. We thank Michelle Girvan, Reza Malek-Madani, Rohan Mukherjee, Guido Novati, and Pantelis Vlachas for insightful discussions, and Jaideep Pathak for sharing his Matlab ESN code for the Kuramoto-Sivashinsky system. We are grateful to Mingchao Jiang and Adam Subel for help with running some of the simulations.

Bibliography85

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] W.D. Collins, P.J. Rasch, B.A. Boville, J.J. Hack, J.R. Mc Caa, D.L. Williamson, B.P. Briegleb, C.M. Bitz, S.J. Lin, and M. Zhang. The formulation and atmospheric simulation of the community atmosphere model version 3 (CAM 3). Journal of Climate , 19(11):2144–2161, 2006.
2[2] W.J. Collins et al. Development and evaluation of an Earth-System model–Had GEM 2. Geoscientific Model Development , 4(4):1051–1075, 2011.
3[3] G.M. Flato. Earth system models: an overview. Wiley Interdisciplinary Reviews: Climate Change , 2(6):783–800, 2011.
4[4] P. Bauer, A. Thorpe, and G. Brunet. The quiet revolution of numerical weather prediction. Nature , 525(7567):47, 2015.
5[5] N. Jeevanjee, P. Hassanzadeh, S. Hill, and A. Sheshadri. A perspective on climate model hierarchies. Journal of Advances in Modeling Earth Systems , 9(4):1760–1771, 2017.
6[6] B. Stevens and S. Bony. What are climate models missing? Science , 340(6136):1053–1054, 2013.
7[7] F. Hourdin, Thorsten M., Andrew G., J.C. Golaz, V. Balaji, Q. Duan, D. Folini, D. Ji, D. Klocke, Y. Qian, et al. The art and science of climate model tuning. Bulletin of the American Meteorological Society , 98(3):589–602, 2017.
8[8] R.R. Garcia, A.K. Smith, D.E. Kinnison, Á. la Cámara, and D.J. Murphy. Modification of the gravity wave parameterization in the Whole Atmosphere Community Climate Model: Motivation and results. Journal of the Atmospheric Sciences , 74(1):275–291, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Data-driven prediction of a multi-scale Lorenz 96 chaotic system using deep learning methods: Reservoir computing, ANN, and RNN-LSTM

Abstract

1 Introduction

2 Materials and Methods

2.1 The multi-scale Lorenz 96 system

2.1.1 Numerical solution

2.1.2 Training and testing datasets

2.2 Reservoir computing-echo state network (RC-ESN)

2.2.1 Architecture

2.2.2 Training and Prediction

2.3 Feed-forward artificial neural network (ANN)

2.3.1 Architecture

2.3.2 Training and Prediction

2.4 Recurrent neural network with long short-term memory (RNN-LSTM)

2.4.1 Architecture

2.4.2 Training and Prediction

3 Results

3.1 Short-term prediction: Comparison of the RC-ESN, ANN, and RNN-LSTM performances

3.2 Short-term prediction: Scaling of RC-ESN and ANN performance with training size NNN

3.3 Short-term prediction: Scaling of RC-ESN performance with reservoir size DDD

3.4 Long-term statistics: Comparison of RC-ESN, ANN, and RNN-LSTM performance

4 Discussion

Appendix A More details on RC-ESN

Appendix B More details on ANN

Appendix C More details on RNN-LSTM

3.2 Short-term prediction: Scaling of RC-ESN and ANN performance with training size $N$

3.3 Short-term prediction: Scaling of RC-ESN performance with reservoir size $D$