Time-series learning of latent-space dynamics for reduced-order model   closure

Romit Maulik; Arvind Mohan; Bethany Lusch; Sandeep Madireddy; Prasanna; Balaprakash; Daniel Livescu

arXiv:1906.07815·physics.comp-ph·February 26, 2020

Time-series learning of latent-space dynamics for reduced-order model closure

Romit Maulik, Arvind Mohan, Bethany Lusch, Sandeep Madireddy, Prasanna, Balaprakash, Daniel Livescu

PDF

TL;DR

This paper compares LSTM and NODE neural networks for learning latent-space dynamics in reduced-order models of the viscous Burgers equation, showing their effectiveness in system closure without intrusive methods.

Contribution

It demonstrates that LSTMs and NODEs can effectively learn system closure in reduced-order models, outperforming traditional Galerkin projection methods.

Findings

01

LSTMs and NODEs accurately reproduce unresolved scale effects.

02

Time-series learning techniques implicitly leverage memory kernels.

03

Neural methods outperform intrusive Galerkin projection in test cases.

Abstract

We study the performance of long short-term memory networks (LSTMs) and neural ordinary differential equations (NODEs) in learning latent-space representations of dynamical equations for an advection-dominated problem given by the viscous Burgers equation. Our formulation is devised in a non-intrusive manner with an equation-free evolution of dynamics in a reduced space with the latter being obtained through a proper orthogonal decomposition. In addition, we leverage the sequential nature of learning for both LSTMs and NODEs to demonstrate their capability for closure in systems which are not completely resolved in the reduced space. We assess our hypothesis for two advection-dominated problems given by the viscous Burgers equation. It is observed that both LSTMs and NODEs are able to reproduce the effects of the absent scales for our test cases more effectively than intrusive dynamics…

Tables2

Table 1. Table 1: Search range for LSTM hyperparameters and their optimal values deployed for the Burgers’ turbulence test case.

Hyperparameter	Type	Starting value	Ending value	Optimal
Sequence size	Integer	5	30	30
Neurons	Integer	5	100	73
Learning rate	Real	0.0001	0.1	0.0005
Momentum	Real	0.99	0.999	0.9988
Epochs	Integer	100	1000	317
Batch Size	Integer	5	30	8

Table 2. Table 2: Search range for NODE hyperparameters and their optimal values deployed for the Burgers’ turbulence test case.

Hyperparameter	Type	Starting value	Ending value	Optimal
Sequence size	Integer	5	30	5
Neurons	Integer	10	100	82
Learning rate	Real	0.0001	0.1	0.0074
Momentum	Real	0.99	0.999	0.9983
Epochs	Integer	200	1200	546
Batch Size	Integer	5	30	21

Equations80

\overset{u}{˙} (x, t, ν) + N [u (x, t, ν)] + L [u (x, t, ν); ν] = 0, (x, t, ν) \in Ω \times T \times P,

\overset{u}{˙} (x, t, ν) + N [u (x, t, ν)] + L [u (x, t, ν); ν] = 0, (x, t, ν) \in Ω \times T \times P,

\dot{u_{h}} (t, ν) + N_{h} [u_{h} (t, ν)] + L_{h} [u_{h} (t, ν); ν] = 0 (t, ν) \in T \times P,

\dot{u_{h}} (t, ν) + N_{h} [u_{h} (t, ν)] + L_{h} [u_{h} (t, ν); ν] = 0 (t, ν) \in T \times P,

\overset{u}{˙} + u \frac{\partial u}{\partial x} = ν \frac{\partial ^{2} u}{\partial x ^{2}}, u (x, 0) = u_{0}, x \in [0, L], u (0, t) = u (L, t) = 0.

\overset{u}{˙} + u \frac{\partial u}{\partial x} = ν \frac{\partial ^{2} u}{\partial x ^{2}}, u (x, 0) = u_{0}, x \in [0, L], u (0, t) = u (L, t) = 0.

X^{f} = span {ϑ^{1}, \dots, ϑ^{f}},

X^{f} = span {ϑ^{1}, \dots, ϑ^{f}},

\displaystyle\mathbf{S}=[\begin{array}[]{c|c|c|c}{\hat{\mathbf{u}}^{1}_{h}}&{\hat{\mathbf{u}}^{2}_{h}}&{\cdots}&{\hat{\mathbf{u}}^{N_{s}}_{h}}\end{array}]\in\mathbb{R}^{N_{h}\times N_{s}},

\displaystyle\mathbf{S}=[\begin{array}[]{c|c|c|c}{\hat{\mathbf{u}}^{1}_{h}}&{\hat{\mathbf{u}}^{2}_{h}}&{\cdots}&{\hat{\mathbf{u}}^{N_{s}}_{h}}\end{array}]\in\mathbb{R}^{N_{h}\times N_{s}},

\hat{u}_{h}^{i} = u_{h}^{i} - \overset{ˉ}{u}_{h}, \overset{ˉ}{u}_{h} = \frac{1}{N _{s}} i = 1 \sum N_{s} u_{h}^{i} .

\hat{u}_{h}^{i} = u_{h}^{i} - \overset{ˉ}{u}_{h}, \overset{ˉ}{u}_{h} = \frac{1}{N _{s}} i = 1 \sum N_{s} u_{h}^{i} .

CW = Λ W, C = S^{T} S \in R^{N_{s} \times N_{s}},

CW = Λ W, C = S^{T} S \in R^{N_{s} \times N_{s}},

ϑ = SW \in R^{N_{h} \times N_{s}} .

ϑ = SW \in R^{N_{h} \times N_{s}} .

X^{r} = span {ψ^{1}, \dots, ψ^{N_{r}}} .

X^{r} = span {ψ^{1}, \dots, ψ^{N_{r}}} .

A = ψ^{T} S \in R^{N_{r} \times N_{s}} .

A = ψ^{T} S \in R^{N_{r} \times N_{s}} .

\displaystyle\hat{\mathbf{S}}=[\begin{array}[]{c|c|c|c}{\tilde{\mathbf{u}}^{1}_{h}}&{\tilde{\mathbf{u}}^{2}_{h}}&{\cdots}&{\tilde{\mathbf{u}}^{N_{s}}_{h}}\end{array}]\approx\boldsymbol{\psi}\mathbf{A}\in\mathbb{R}^{N_{h}\times N_{s}},

\displaystyle\hat{\mathbf{S}}=[\begin{array}[]{c|c|c|c}{\tilde{\mathbf{u}}^{1}_{h}}&{\tilde{\mathbf{u}}^{2}_{h}}&{\cdots}&{\tilde{\mathbf{u}}^{N_{s}}_{h}}\end{array}]\approx\boldsymbol{\psi}\mathbf{A}\in\mathbb{R}^{N_{h}\times N_{s}},

\frac{\sum _{i = 1}^{N_{s}} u ^ _{h}^{i} - u ~ _{h}^{i} _{R^{N_{h}}}^{2}}{\sum _{i = 1}^{N_{s}} u ^ _{h}^{i} _{R^{N_{h}}}^{2}} = \frac{\sum _{i = N_{r} + 1}^{N_{s}} λ _{i}^{2}}{\sum _{i = 1}^{N_{s}} λ _{i}^{2}},

\frac{\sum _{i = 1}^{N_{s}} u ^ _{h}^{i} - u ~ _{h}^{i} _{R^{N_{h}}}^{2}}{\sum _{i = 1}^{N_{s}} u ^ _{h}^{i} _{R^{N_{h}}}^{2}} = \frac{\sum _{i = N_{r} + 1}^{N_{s}} λ _{i}^{2}}{\sum _{i = 1}^{N_{s}} λ _{i}^{2}},

\dot{\hat{u}}_{h} (x, t, ν) + N_{h} [\hat{u}_{h} (x, t, ν)] + L_{h} [\hat{u}_{h} (x, t, ν); ν] = 0,

\dot{\hat{u}}_{h} (x, t, ν) + N_{h} [\hat{u}_{h} (x, t, ν)] + L_{h} [\hat{u}_{h} (x, t, ν); ν] = 0,

ψ \dot{a_{r}} (t, ν) + N_{h} [ψ a_{r} (t, ν)] + L_{h} [ψ a_{r} (t, ν); ν] = 0,

ψ \dot{a_{r}} (t, ν) + N_{h} [ψ a_{r} (t, ν)] + L_{h} [ψ a_{r} (t, ν); ν] = 0,

\dot{a_{r}} (t, ν) + N_{r} [a_{r} (t, ν)] + L_{r} [a_{r} (t, ν); ν] = 0,

\dot{a_{r}} (t, ν) + N_{r} [a_{r} (t, ν)] + L_{r} [a_{r} (t, ν); ν] = 0,

\frac{d a _{k}}{d t} = b_{k}^{1} + b_{k}^{2} + i = 1 \sum N_{r} (L_{ik}^{1} + L_{ik}^{2}) a_{i} + i = 1 \sum N_{r} j = 1 \sum N_{r} N_{ij k} a_{i} a_{j}, for k = 1, 2, \dots, N_{r},

\frac{d a _{k}}{d t} = b_{k}^{1} + b_{k}^{2} + i = 1 \sum N_{r} (L_{ik}^{1} + L_{ik}^{2}) a_{i} + i = 1 \sum N_{r} j = 1 \sum N_{r} N_{ij k} a_{i} a_{j}, for k = 1, 2, \dots, N_{r},

b_{k}^{1} b_{k}^{2} L_{ik}^{1} L_{ik}^{2} N_{ij k} = (ν L [\overline{u}_{h}], ψ^{k}), = (N [\overline{u}_{h}; \overline{u}_{h}], ψ^{k}), = (ν L [ψ^{i}], ψ^{k}), = (N [\overline{u}_{h}; ψ^{i}] + N [ψ^{i}; \overline{u}_{h}], ψ^{k}), = (N [ψ^{i}; ψ^{j}], ψ^{k}),

b_{k}^{1} b_{k}^{2} L_{ik}^{1} L_{ik}^{2} N_{ij k} = (ν L [\overline{u}_{h}], ψ^{k}), = (N [\overline{u}_{h}; \overline{u}_{h}], ψ^{k}), = (ν L [ψ^{i}], ψ^{k}), = (N [\overline{u}_{h}; ψ^{i}] + N [ψ^{i}; \overline{u}_{h}], ψ^{k}), = (N [ψ^{i}; ψ^{j}], ψ^{k}),

(f, g) = \int_{Ω} f g d Ω.

(f, g) = \int_{Ω} f g d Ω.

a_{r} (t = 0) = (ψ^{T} \hat{u}_{h} (t = 0)) .

a_{r} (t = 0) = (ψ^{T} \hat{u}_{h} (t = 0)) .

\frac{d a _{r}}{d t} = f (a_{r}, θ), (θ) \in Θ,

\frac{d a _{r}}{d t} = f (a_{r}, θ), (θ) \in Θ,

\frac{d b _{g}}{d t} = - b_{g}^{T} \frac{\partial f ( a _{r} , θ )}{\partial a _{r}},

\frac{d b _{g}}{d t} = - b_{g}^{T} \frac{\partial f ( a _{r} , θ )}{\partial a _{r}},

b_{g} = [\frac{\partial E}{\partial a _{r}}, \frac{\partial E}{\partial θ}, \frac{\partial E}{\partial t}]^{T}, E \in E,

b_{g} = [\frac{\partial E}{\partial a _{r}}, \frac{\partial E}{\partial θ}, \frac{\partial E}{\partial t}]^{T}, E \in E,

input gate: forget gate: output gate: internal state: output: G_{i} = φ_{S} \circ F_{i}^{N_{c}} (a_{r}), G_{f} = φ_{S} \circ F_{f}^{N_{c}} (a_{r}), G_{o} = φ_{S} \circ F_{o}^{N_{c}} (a_{r}), s_{t} = G_{f} ⊙ s_{t - 1} + G_{i} ⊙ (φ_{T} \circ F_{a_{r}}^{N_{c}} (a_{r})), G_{o} \circ φ_{T} (s_{t}),

input gate: forget gate: output gate: internal state: output: G_{i} = φ_{S} \circ F_{i}^{N_{c}} (a_{r}), G_{f} = φ_{S} \circ F_{f}^{N_{c}} (a_{r}), G_{o} = φ_{S} \circ F_{o}^{N_{c}} (a_{r}), s_{t} = G_{f} ⊙ s_{t - 1} + G_{i} ⊙ (φ_{T} \circ F_{a_{r}}^{N_{c}} (a_{r})), G_{o} \circ φ_{T} (s_{t}),

F^{n} (x) = W x + B,

F^{n} (x) = W x + B,

\frac{d a _{r}}{d t} = e^{W t} W a_{r},

\frac{d a _{r}}{d t} = e^{W t} W a_{r},

W = - N_{r} [.] - L_{r} [.] .

W = - N_{r} [.] - L_{r} [.] .

P a_{h} = \frac{( a _{r}^{0} , a _{h}^{0} )}{( a _{r}^{0} , a _{r}^{0} )} a_{r}, Q = I - P,

P a_{h} = \frac{( a _{r}^{0} , a _{h}^{0} )}{( a _{r}^{0} , a _{r}^{0} )} a_{r}, Q = I - P,

\frac{d a _{r}}{d t} = e^{W t} (Q + P) W a_{r},

\frac{d a _{r}}{d t} = e^{W t} (Q + P) W a_{r},

e^{W t} P W a_{r} = \frac{( a _{r}^{0} , a _{h}^{0} )}{( a _{r}^{0} , a _{r}^{0} )} e^{W t} a_{r} = M a_{r},

e^{W t} P W a_{r} = \frac{( a _{r}^{0} , a _{h}^{0} )}{( a _{r}^{0} , a _{r}^{0} )} e^{W t} a_{r} = M a_{r},

e^{W t} Q W a_{r}^{0} = e^{Q W t} Q W a_{r}^{0} + \int_{0}^{t} e^{W (t - t_{1})} P W e^{Q W t_{1}} Q W a_{r}^{0} d t_{1} = G a_{r},

e^{W t} Q W a_{r}^{0} = e^{Q W t} Q W a_{r}^{0} + \int_{0}^{t} e^{W (t - t_{1})} P W e^{Q W t_{1}} Q W a_{r}^{0} d t_{1} = G a_{r},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Time-series learning of latent-space dynamics for reduced-order model closure

Romit Maulik

Arvind Mohan

Bethany Lusch

Sandeep Madireddy

Prasanna Balaprakash

Daniel Livescu

Argonne Leadership Computing Facility, Argonne National Laboratory, Lemont, IL 60439, USA11footnotemark: 1

Center for Nonlinear Studies/CCS-2 Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA22footnotemark: 2

Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL 60439, USA33footnotemark: 3

Abstract

We study the performance of long short-term memory networks (LSTMs) and neural ordinary differential equations (NODEs) in learning latent-space representations of dynamical equations for an advection-dominated problem given by the viscous Burgers equation. Our formulation is devised in a non-intrusive manner with an equation-free evolution of dynamics in a reduced space with the latter being obtained through a proper orthogonal decomposition. In addition, we leverage the sequential nature of learning for both LSTMs and NODEs to demonstrate their capability for closure in systems which are not completely resolved in the reduced space. We assess our hypothesis for two advection-dominated problems given by the viscous Burgers equation. It is observed that both LSTMs and NODEs are able to reproduce the effects of the absent scales for our test cases more effectively than intrusive dynamics evolution through a Galerkin projection. This result empirically suggests that time-series learning techniques implicitly leverage a memory kernel for coarse-grained system closure as is suggested through the Mori-Zwanzig formalism.

keywords:

ROMs , LSTMs , Neural ODEs , Closures

††journal: Elsevier

1 Introduction

High-fidelity simulations of systems characterized by nonlinear partial differential equations represent immense computational expenditure and are prohibitive for decision-making tasks for applications. To address this issue, there has recently been a significant quantity of research into the reduced-order modeling (ROM) of such systems to reduce the degrees of freedom of the forward problem to manageable magnitudes [1, 2, 3, 4, 5, 6, 7]. As such, this field finds extensive application in control [8], multi-fidelity optimization [9] and uncertainty quantification [10, 11] among others. However, ROMs are limited in how they handle nonlinear dependence and perform poorly for complex physical phenomena which are inherently multiscale in space and time [12, 13, 14, 15]. To address this issue, researchers continue to search for efficient and reliable ROM techniques for transient nonlinear systems.

A common ROM development procedure may be described by the following tasks:

Reduced basis identification. 2. 2.

Nonlinear dynamical system evolution in the reduced basis. 3. 3.

Reconstruction in full-order space for assessments.

The first two items of the aforementioned schema individually constitute areas of extensive investigation, and there are studies which attempt to combine these into one optimization problem as well. In this investigation, we utilize conventional ideas for reduced basis identification with the use of the proper orthogonal decomposition (POD) for finding the optimal global basis. We proceed by considering a parameterized time-dependent partial differential equation given (in the full-order space) by

[TABLE]

where $\Omega\subset\mathbb{R}^{1},\mathcal{T}=[0,T],\mathcal{P}\subset\mathbb{R}^{1}$ and $\mathcal{N}$ , $\mathcal{L}$ are non-linear and linear operators respectively. Our system is characterized by a solution field $u:\Omega\times\mathcal{T}\times\mathcal{P}\rightarrow\mathbb{R}^{1}$ and appropriately chosen initial as well as boundary conditions. We assume that our system of equations can be solved in space-time on a discrete grid resulting in the following systems of parameterized ODEs

[TABLE]

where $\mathbf{u}_{h}:\mathcal{T}\times\mathcal{P}\rightarrow\mathbb{R}^{N_{h}}$ is a discrete solution and $N_{h}$ is the number of spatial degrees of freedom. Specifically, our problem is given by the viscous Burgers’ equation with periodic boundary conditions which can be represented as

[TABLE]

It is well known that the above equations are capable of generating discontinuous solutions even if initial conditions are smooth and $\nu$ is sufficiently small due to advection-dominated behavior. We can then proceed to project our governing equations onto a space of reduced orthonormal bases for inexpensive forward solves of the dynamics.

1.1 Proper orthogonal decomposition

In this section, we review the proper orthogonal decomposition (POD) technique for the construction of a reduced basis [16, 17]. The interested reader may also find an excellent explanation of POD and its relationship with other dimension-reduction techniques in [18]. The POD procedure is tasked with identifying a space

[TABLE]

which approximates snapshots optimally with respect to the $L^{2}-$ norm. The process of $\boldsymbol{\vartheta}$ generation commences with the collections of snapshots in the snapshot matrix

[TABLE]

where $\hat{\mathbf{u}}_{i}:\mathcal{T}\times\mathcal{P}\rightarrow\mathbb{R}^{N_{h}}$ corresponds to an individual snapshot in time (for a total of $N_{s}$ snapshots) of the discrete solution domain with mean value removed i.e.,

[TABLE]

with $\overline{\mathbf{u}}_{i}:\mathcal{P}\rightarrow\mathbb{R}^{N_{h}}$ being the time averaged solution field. Our POD bases can then be extracted efficiently through the method of snapshots where we solve an eigenvalue problem for a correlation matrix

[TABLE]

where $\Lambda=\operatorname{diag}\left\{\lambda_{1},\lambda_{2},\cdots,\lambda_{N_{s}}\right\}\in\mathbb{R}^{N_{s}\times N_{s}}$ is the diagonal matrix of eigenvalues and $\mathbf{W}\in\mathbb{R}^{N_{s}\times N_{s}}$ is the eigenvector matrix. Our POD basis matrix can then be obtained by

[TABLE]

In practice a reduced basis $\boldsymbol{\psi}\in\mathbb{R}^{N_{h}\times N_{r}}$ is built by choosing the first $N_{r}$ columns of $\boldsymbol{\vartheta}$ for the purpose of efficient ROMs where $N_{r}\ll N_{s}$ . This reduced basis spans a space given by

[TABLE]

The coefficients of this reduced basis (which capture the underlying temporal effects) may be extracted as

[TABLE]

The POD approximation of our solution is then obtained via

[TABLE]

where $\tilde{\mathbf{u}}_{h}^{i}:\mathcal{T}\times\mathcal{P}\rightarrow\mathbb{R}^{N_{h}}$ corresponds to the POD approximation to $\hat{\mathbf{u}}_{h}^{i}$ . The optimal nature of reconstruction may be understood by defining the relative projection error

[TABLE]

which exhibits that with increasing retention of POD bases, increasing reconstruction accuracy may be obtained. As shall be explained later, the coefficient matrix $\mathbf{A}$ forms our training data for time-series learning.

1.2 Galerkin-projection onto reduced space

The orthogonal nature of the POD basis may be leveraged for a Galerkin projection onto the reduced basis. We start by revisiting Equation (1) written in the form of an evolution equation for fluctuation components i.e.,

[TABLE]

which can expressed in the reduced basis as

[TABLE]

where $\mathbf{a}_{r}:\mathcal{T}\times\mathcal{P}\rightarrow\mathbb{R}^{N_{r}},\mathbf{a}_{r}\in\alpha$ corresponds to the temporal coefficients at one time instant of the system evolution (i.e., equivalent to a particular column of $\mathbf{A}$ ). The orthogonal nature of the reduced basis can be leveraged to obtain

[TABLE]

which we denote the POD Galerkin-projection formulation (POD-GP). Note that we have assumed that the residual generated by the truncated representation of the full-order model is orthogonal to the reduced basis. We note that it is precisely this assumption that necessitates closure. From the point of view of the Burgers’ equations given in Equation (5), our POD-GP implementation is given as

[TABLE]

where $a_{k}:\mathcal{T}\times\mathcal{P}\rightarrow\mathbb{R}^{1}$ is one component of $\mathbf{a}_{r}$ and where

[TABLE]

are operators which can be computed offline (i.e, $b_{k}^{1},L_{ik}^{1}:\mathcal{P}\rightarrow\mathbb{R}^{1};b_{k}^{2},L_{ik}^{2},N_{ijk}\in\mathbb{R}^{1}$ ) and where we have defined an inner-product by

[TABLE]

with $L[f]=\frac{\partial^{2}f}{\partial x^{2}}$ and $N[f;g]=-f\frac{\partial g}{\partial x}$ , the operators stemming from the Burgers’ equation. It will be discussed and demonstrated later, that the absence of higher-basis nonlinear interactions causes errors in the forward evolution of this system of equations. Note that the POD-GP essentially consists of $N_{r}$ coupled ODEs and is solved by a standard total-variation diminishing third-order Runge-Kutta method. The reduced degrees of freedom lead to very efficient forward solves of the problem. Note that this transformed problem has initial conditions given by

[TABLE]

1.3 Contribution

In this article, we investigate strategies to bypass the POD-GP process with an intrusion-free (or equation-free) learning of dynamics in reduced space. We deploy machine learning strategies devised for sequential data learning on time-series samples extracted from the true dynamics of the problems considered. In recent literature, there has been a considerable interest in the utility of machine learning techniques for effective ROM construction. Data-driven techniques have been used in various phases of ROM techniques such as in reduced basis construction [19], augmented dynamics evolution [20, 21, 14, 5, 22, 15, 23, 6, 24, 25] and system identification [26, 27, 28, 29, 30].

In this article, we study the utility of data-driven techniques to make a posteriori predictions for state evolution in reduced space with assessments made on their ability to reconstruct transient characteristics of the full-order solution. It is also observed that the ability to learn a time series (possible through the in-built design of memory and an assumption of non-i.i.d sequential data in the learning) leads to an implicit closure whereby the effects of uncaptured frequencies are retained, drawing parallels to a Mori-Zwanzig formalism. In related work, the study in [31] utilizes recurrent neural networks to explicitly specify a sub-grid stress model for large eddy simulation with the lower frequency evolution controlled by coarse-grained partial differential equations. Our framework is also similar to [32] where an LSTM is utilized to learn a parametric memory kernel for explicit closure of nonlinear PDEs. The present study can be considered a non-intrusive counterpart of these investigations. In addition, we also detail a formalism for efficient machine learning architecture selection using scaleable Bayesian Optimization. Our test problems are given by the advection-dominated viscous Burgers equation [14] with a moving shock as well as a pseudo-turbulence test case denoted ‘Burgulence’ [33, 34] showing the characteristic $k^{-2}$ scaling in wavenumber ( $k$ ) space.

2 Latent-space learning

In this section, we outline our machine learning techniques for latent-space time-series learning. We study two techniques built around the premise of preserving memory effects within their architecture - neural ordinary differential equations (NODE) and long short-term memory networks (LSTMs). Both frameworks are tasked with predicting the evolution of $\mathbf{a}_{r}$ over time.

2.1 Neural ordinary differential equations

In recent times, there have been several studies which have interpreted residual neural networks using dynamical systems theory [35, 36, 37, 38]. The framework of the NODE [39] envisions the learning of $\mathbf{a}_{r}$ over time as

[TABLE]

where $\Theta\subset\mathbb{R}^{N_{w}}$ is a space of $N_{w}$ user-defined model parameters. The learning can be thought to be through a continuous backpropagation through time, i.e., where there an infinite number of layers in the network with the input layer given by $\mathbf{a}_{r}(t=0)$ and the output layer given by $\mathbf{a}_{r}(t=T)$ . Therefore the NODE approximates the latent-space evolution as an ordinary differential equation which is continuous through time in a manner similar to the Galerkin projection. The function $f:\alpha\times\Theta\rightarrow\mathbb{R}^{N_{r}}$ in this study is represented by a neural network with a single 40-neuron hidden layer and a tan-sigmoid activation where $\alpha\subset\mathbb{R}^{N_{r}},\Theta\subset\mathbb{R}^{N_{w}}$ and $N_{w}$ is the number of parameters of the neural network architecture. Note that the assumption of a single hidden layer architecture for the right-hand side of the latent-space ODE allows for upper-bound guarantees given by the universal approximation theorem (Barron, 1993) although more complicated dynamics may require deeper architectures. Readers are referred to the work of Chen et al. [39], for a detailed discussion of the neural ODE and its utility in learning sequential data.

The forward propagation of information through time (i.e., from $t=0$ to $t=T$ ) is performed through a standard ODE solver (in this case a first-order accurate Euler method) whereas backpropagation of errors is performed through the backward solve of an adjoint system given by

[TABLE]

where $\mathbf{b}_{g}:\mathcal{T}\times\alpha\times\mathcal{E}\rightarrow\mathbb{R}^{N_{r}+N_{w}+1}$ is the augmented state vector given by

[TABLE]

with scalar loss at final time $\mathcal{E}\subset\mathbb{R}^{1}$ obtained at the $t=T$ following forward-propagation. Each calculation of $E$ is followed by the backward solve of Equation (31) (which may be interpreted as continuous backpropagation in time) to calculate $\mathbf{b}_{g}(t=0)$ which can then be used to determine $\frac{\partial E}{\partial\theta}(t=0)$ . This value of the gradient can then be used to update the parameters $\theta$ using an optimization algorithm. In this article, we utilize RMSProp for our loss minimization with a learning rate of 0.01 and a momentum parameter of 0.9. Instead of performing the forward deployment of the NODE and backpropagation of the model errors for the entire domain, we utilize 1000 samples of our total data as our training and validation dataset using the technique detailed in the original article to speed up training. Each sample is given by a sequence of 10 timesteps. The training, for each epoch, is performed using 10 randomly chosen samples (i.e., our batch size is 10) for the calculation of parameter gradients. The final gradient deployed for model training is averaged across this batch. A set of samples (20% of the total 1000), chosen randomly before training, is kept aside from the learning process to assess validation errors. Note that validation errors are also characterized by final timestep loss (i.e., at timestep 10 of each batch) thereby incorporating the degree of error accummulation due to an inaccurately trained model at that epoch. The best model corresponds to the lowest validation loss (averaged across all validation samples). We do not utilize a separate dataset for the purpose of testing.

For the purpose of testing, we note that all assessments for the problems are through forward (or a posteriori) deployment. In other words, the NODE is specified an initial condition and then deployed to obtain state vectors using an ODE forward solve until the final time. The prediction at each time step is obtained by the Euler integration which requires the knowledge of previous state alone. Note that apart from the first prediction by NODE (which utilizes the initial condition), state predictions are recursively utilized for predicting the future. Therefore, testing may be assumed to be a long-term predictive test of the model learning in the presence of deployment error.

2.2 Long short-term memory networks

The long short-term memory (LSTM) network was introduced to consider time-delayed processes where events further back in the past may potentially affect predictions for the current location in the sequence. The basic equations of the LSTM in our context for an input variable $\mathbf{a}_{r}$ are given by

[TABLE]

in which $\boldsymbol{\varphi}_{S}$ and $\boldsymbol{\varphi}_{L}$ refer to tangent sigmoid and tangent hyperbolic activation functions respectively, $N_{c}$ is the number of hidden layer units in the LSTM network. Note that $\mathcal{F}^{n}$ refers to a linear operation given by a matrix multiplication and subsequent bias addition i.e,

[TABLE]

where $\boldsymbol{W}\in\mathbb{R}^{n\times m}$ and $\boldsymbol{B}\in\mathbb{R}^{n}$ for $\mathbf{x}\in\mathbb{R}^{m}$ and where $\mathbf{a}\circ\mathbf{b}$ refers to a Hadamard product of two vectors. The LSTM implementation will be used to advance $\mathbf{a}_{r}$ as a function of time in the reduced space. The LSTM network’s primary utility is the ability to control information flow through time with the use of the gating mechanisms. A greater value of the forget gate (post sigmoidal activation), allows for a greater preservation of past state information through the sequential inference of the LSTM, whereas a smaller value suppresses the influence of the past. Our LSTM deployment utilized 32 neurons in its input, forget, output and state calculation operations each and utilized a learning rate of 0.001. It uses a sequence to sequence prediction utilized as a rolling window for predicting the output at the next timestep. We utilize a batch size of 16 samples with each sample having a sequence of 10 timesteps for all of our LSTM deployments. As in the previous learning approach, a set of data is kept aside for validation. This validation loss is used to make decisions about model selection. Note that the total number of samples (1000) is the same as the NODE deployment.

2.3 Connection with Mori-Zwanzig formalism

In this section, we outline the Mori-Zwanzig formalism [40, 41] for the viscous Burgers equation and connect it to time-series learning in POD space. We frame the (full-order) dynamics evolution in latent space using the following formulation derived from the first step of the Mori-Zwanzig treatment

[TABLE]

where $\mathcal{W}$ is the viscous Burgers operator given by

[TABLE]

We proceed by defining two self-adjoint projection operators into orthogonal subspaces given by

[TABLE]

with $QP=0$ and $\textbf{a}^{0}=\textbf{a}(t=0)$ . Therefore, $P$ may be assumed to be a projection of our full-order representation in POD space ( $\mathbf{a}_{h}$ living in $\textbf{X}^{f}$ ) onto the reduced basis ( $\mathbf{a}_{r}$ living in $\mathbf{X}^{r}$ ). We can further expand our system as

[TABLE]

which may further be decoupled to a Markov-like projection operator $\mathcal{M}$ given by

[TABLE]

and a memory operator $\mathcal{G}$ given by

[TABLE]

for which we have used Dyson’s formula [42] and where $t_{1}$ corresponds to a hyperparameter that specifies the length of memory retention. The second relationship may be assumed to be a combination of memory effects and noise. The final evolution of the system can then be bundled into a linear combination of these two kernels i.e.,

[TABLE]

The reader may compare this expression with that of the internal state update within an LSTM which we revisit here -

[TABLE]

where a linear combination of a nonlinearly transformed input vector at time $t$ with the gated result of a hidden-state at a previous time $t-1$ is utilized for calculating the result vector at the current time. The process of carrying a state through time via gating may be assumed to be a representation of the memory integral (as well as the noise) whereas the utilization of the current input may be assumed to be the Markovian component of the map. In contrast, from the point of view of the NODE implementation, the goal is to learn $e^{\mathcal{W}t}\mathcal{W}\mathbf{a}_{r}$ directly through a neural network.

3 Experiments

In this section, we assess the performance of both NODE and LSTM frameworks in representing latent-space dynamics appropriately. We investigate two problems given by the viscous Burgers’ equation in a periodic domain. Both problems are advection dominated where the first has a moving discontinuity over time (which we shall designate the advecting shock problem) and the second which is characterized by standing shocks of various magnitudes (which we designate Burgulence). Their problem statement and results are shown below.

3.1 Advecting shock

Our first problem is given by the following initial and boundary conditions

[TABLE]

where we specify $L=1$ and maximum time $t_{max}=2$ . An analytical solution for the above set of equations exists and is given by

[TABLE]

where $t_{0}=\text{exp}(Re/8)$ and $Re=1/\nu$ is kept fixed at 1000. We directly utilize this expression to generate our snapshot data for ROM assessments. A visualization of the time evolution of the initial condition is shown in Figure 1. As outlined in a previous assessment of this problem [14], a reduced basis constructed of 20 basis vectors retains 99.93% of the energy of the system. For the purpose of our assessments, we retain solely three modes which results in an unresolved ROM which corresponds to only 86.71 % of the total energy - thus necessitating closure.

We perform an optimization for learning the modal coefficient time series using the LSTM and NODE frameworks. To recap model specifics, we deploy NODE (using 40 neurons) and LSTM (using 32 hidden layer neurons) for learning the sequential nature of the modal coefficient evolution in POD space. We utilize the RMSprop optimizer using a learning rate of 0.01 for the former and 0.001 for the latter and a momentum coefficient of 0.9 for both. Batch sizes of 10 and 16 respectively are also used at each epoch of the learning phase. 1000 randomly chosen sequence lengths of 10 are utilized for learning and validation through time with 20% of the total data kept aside for the latter. We note that the best validation loss (aggregated over all validation samples) is utilized for model selection. Figure 2 shows the progress to convergence for both LSTM and NODE architectures during training for the first three modal coefficients. Both NODE and LSTM trainings are run until validation loss magnitudes hover around a magnitude of 10-4. It is observed that the LSTM framework reaches convergence more quickly although the oscillating losses of the NODE potentially indicate better exploration. We note that the oscillations may also indicate the requirement of a lower learning rate.

The time-series predictions for the trained frameworks are shown in Figure 3 where $a_{0},a_{1}$ and $a_{2}$ correspond to the first three retained modes. For the purpose of comparison, we also show predictions from GP and the true modal coefficients, the latter of which are utilized for training our time-series predictions. We can observe that both LSTM and NODE deployments capture coefficient trends accurately indicating that sequential behavior has been learned. The GP predictions can be seen to show unphysical growth in coefficient amplitude due to the lack of presence of the finer modes. However, LSTM and NODE deployments embed memory into their formulation in the form of a hidden state or through explicit learning of a latent-space ODE respectively. The memory-based nature of their learning leads to excellent agreement with the true behavior of the resolved scales.

The final time reconstructions for the true as well as the GP, LSTM and NODE time-series predictions are shown in Figure 4. One can observe that at this severely truncated state, the discontinuity is not completely resolved. The GP reconstructions show the manifestation of large oscillations (now in the physical domain) whereas NODE and LSTM are able to recover the true solution well. Figures 5 and 6 show a validation of our learning in an ensemble sense, where multiple architectures (with slight differences in the hidden layer neurons) are able to recover similar trends in accuracy as examined through final time reconstruction ability. This reinforces our assumption that an implicit closure is being learned through time-series trends in a statistical manner. The corresponding training losses for the LSTM and NODE architectures are shown in Figures 7 and 8 where it can be seen similar learning trends are obtained with slight variations in the number of trainable parameters.

3.2 Burgers’ turbulence

Our next test case is given by the challenging Burgers’ turbulence or Burgulence test case which leads to multiscale behavior in wavenumber space. Our problem domain is given by a length, $L=2\pi$ , and the initial condition is specified by an initial energy spectrum (in wavenumber space) given by

[TABLE]

where $k$ is the wavenumber and $k_{0}=10$ is the parameter at which the peak value of the energy spectrum is obtained. The constant $A$ is set to the following value

[TABLE]

in order to ensure a total energy of $\int E(k)dk=1/2$ at the initial condition. The initial velocity magnitudes can be expressed in wavenumber space by the following relation with our previously defined spectrum,

[TABLE]

where $\Psi(k)$ is a uniform random number generated between 0 and 1 at each wavenumber. Note that this distribution is constrained by $\Psi(k)=-\Psi(k)$ to ensure that a real initial condition in physical space is obtained. For the purpose of assessment, we use energy spectra given by

[TABLE]

The aforementioned initial conditions are solved for the viscous Burgers equation in wavenumber space by a Runge-Kutta Crank-Nicolson scheme as described in [43]. Note that our $\nu$ is chosen to be $2.5\times 10^{-3}$ to ensure that sharp discontinuities emerge from the smooth initial condition. Our NODE and LSTM hyperparameters are identical to the previous test case. Our investigations here are performed for the initial condition (and its corresponding time evolution) as shown in Figure 9. It may be observed that there is a considerable multiscale element to the nature of the solution - which makes this a challenging problem for POD-ROM techniques.

Figure 10 shows reduced-space time-series evolutions of the three retained modal coefficients for the frameworks we are comparing. It can be observed that the LSTM and NODE techniques are successful in coefficient evolution stabilization in comparison to GP although the LSTM can be seen to add an element of phase error. The NODE, however, captures latent-space trends exceptionally well. The performance of these time-series learning models is further assessed by their reconstruction in physical space as shown in Figure 11 where it can be seen that the LSTM and NODE perform well in preventing spurious oscillations near discontinuities as exhibited by an unclosed GP evolution. A further validation of this hypothesis is observed in Figure 12 where kinetic energy spectra in wavenumber space show that the high residuals of the GP method are controlled effectively by the LSTM and NODE deployments. The LSTM is seen to result in slightly higher residuals for this particular test case and choice of hyperparameters and optimizer.

3.3 Improving performance through hyperparameter search

While the results obtained in the previous sections indicate an acceptable choice for hyperparameters, we utilize Deephyper [44] to improve the test performance of our frameworks. This is motivated by the comparitively poorer performance of the LSTM in the Burgulence experiment. Deephyper relies on an asynchronous model based search (i.e., a dynamically updated surrogate model $\mathcal{S}$ which is inexpensive to evaluate) for obtaining hyperparameters with the lowest validation losses. To ensure an expressive surrogate model which is still computationally tractable, we utilize random forests (RF). This results in a superior search algorithm as compared to both a random-search as well as a genetic algorithm based search. We note that RF also gives us the ability to handle discrete and non-ordinal parameters directly without the need for any encoding. Deephyper is configured for searching the hyperparameter space using a standard Bayesian optimization framework. In other words, for each sampled configuration $s$ , $\mathcal{S}$ predicts a mean value for the validation loss $\mu(s)$ and standard deviation $\sigma(s)$ . This information is utilized recursively to improve the approximation to the loss surface as predicted by $\mathcal{S}$ . In terms of exploring the hyperparameter search space, evaluation points with small values of $\mu(s)$ indicate that $s$ can potentially result in the reduction of validation error subject to the accuracy of $\mathcal{S}$ . Evaluation of points with large values of $\sigma(s)$ improves $\mathcal{S}$ since these locations are areas where $\mathcal{S}$ is least confident about the approximation surface. The choice for the selection of a configuration $s$ is utilized by minimizing an acquisition function given by

[TABLE]

where $\lambda=1.96$ for encouraging exploration.

Deephyper requires a range specification for real variables and a list of possible choices for discrete hyperparameters. Table 1 outlines the range of hyperparameters for the LSTM architecture utilized for the Burgers’ turbulence test case as well as the optimal hyperparameters obtained. A summary of the distribution of sampled hyperparameters and pairwise dependencies is also shown in Figure 13. Note that loss is encoded as negative since the hyperparameter search is based on objective function maximization. Hyperparameter correlations are summarized in Figure 14 where it is observed that most hyperparameters are weakly correlated with each other. However, it must be noted that these results are problem specific. In total, 2151 hyperparameter combinations were evaluated during the process of this search.

For the purpose of comparison, we also show results from a similar hyperparameter search experiment for the NODE but for the advecting shock experiment. The optimal parameters and ranges of this search are shown in Table 2. A summary of the distribution of sampled hyperparameters and pairwise dependencies is also shown in Figure 15. Correlation plots between hyperparameters are shown in Figure 16. In total, 734 hyperparameter combinations were evaluated during the process of this search.

Finally we deploy the optimal hyperparameter configuration for an a posteriori assessment with results as observed in Figure 17. It is observed that an improved performance has been obtained using the LSTM which now matches NODE and true observations. In addition, an analysis of the spectra and residuals in Figure 18 confirms the superior performance as well. Deephyper has successfully led to an improved LSTM architecture.

4 Discussion and conclusions

In this article, we have investigated the utility of using LSTMs and NODEs as non-intrusive learning models for the projections of nonlinear partial differential equations to a latent-space spanned by severely truncated POD modes. We note that the choice of the POD modes (which form a linear subspace) also ensures the applicability of the symmetries of the PDE-governed solution on the machine-learned predictions.

Our ideas are tested on two test cases governed by the viscous Burgers’ equation with the first exhibiting an advecting shock and the second displaying a multiscale nature in full-order space. Both LSTM and NODE formulations are seen to learn the transient nature of our systems in the reduced space since they exploit the sequential nature of data and end up providing an implicit closure. In the second case, we also utilize Deephyper, a scaleable Bayesian optimization package for an improved hyperparameter configuration choice in order to obtain superior performance for the LSTM. The non-i.i.d assumption of the data and associated learning allows for the embedding of a memory effect which provides for accurate coarse-grained evolution of modal coefficients in a manner similar to the Mori-Zwanzig formalism. Our assessments reveal that the machine learning techniques studied here are able to provide stable evolutions of the modal coefficients in comparison to their intrusive and unclosed counterpart (i.e., GP).

We conclude by noting that ROM developments which incorporate history explicitly (such as in the LSTM) or implicitly (such as through a NODE) represent an attractive avenue for exploration for efficient reduced basis dynamics learning of systems which are advection-dominated.

5 Data availability

All the relevant data and codes for this study shall be provided in a public repository at https://github.com/Romit-Maulik/ML_ROM_Closures.

6 Acknowledgements

This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research was funded in part and used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This project was also funded by the Los Alamos National Laboratory, 2019 LDRD grant “Machine Learning for Turbulence”. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. DOE or the United States Government.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Carlberg et al. [2011] K. Carlberg, C. Bou-Mosleh, C. Farhat, Efficient non-linear model reduction via a least-squares Petrov–Galerkin projection and compressive tensor approximations, Int. J. Numer. Meth. Eng. 86 (2011) 155–181.
2Wang et al. [2012] Z. Wang, I. Akhtar, J. Borggaard, T. Iliescu, Proper orthogonal decomposition closure models for turbulent flows: a numerical comparison, Comput. Meth. Appl. M. 237 (2012) 10–26.
3San and Borggaard [2015] O. San, J. Borggaard, Principal interval decomposition framework for POD reduced-order modeling of convective Boussinesq flows, Int. J. Numer. Meth. Fl. 78 (2015) 37–62.
4Ballarin et al. [2015] F. Ballarin, A. Manzoni, A. Quarteroni, G. Rozza, Supremizer stabilization of POD–Galerkin approximation of parametrized steady incompressible Navier–Stokes equations, Int. J. Numer. Meth. Eng. 102 (2015) 1136–1161.
5San and Maulik [2018] O. San, R. Maulik, Extreme learning machine for reduced order modeling of turbulent geophysical flows, Phys. Rev. E 97 (2018) 042322.
6Wang et al. [2019] Q. Wang, J. S. Hesthaven, D. Ray, Non-intrusive reduced order modeling of unsteady flows using artificial neural networks with application to a combustion problem, J. Comp. Phys. 384 (2019) 289–307.
7Choi and Carlberg [2019] Y. Choi, K. Carlberg, Space–time least-squares petrov–galerkin projection for nonlinear model reduction, SIAM J. Sci. Comput. 41 (2019) A 26–A 58.
8Proctor et al. [2016] J. L. Proctor, S. L. Brunton, J. N. Kutz, Dynamic mode decomposition with control, SIAM J. Appl. Dyn. Syst. 15 (2016) 142–161.