Modeling the Dynamics of PDE Systems with Physics-Constrained Deep   Auto-Regressive Networks

Nicholas Geneva; Nicholas Zabaras

arXiv:1906.05747·physics.comp-ph·December 4, 2019

Modeling the Dynamics of PDE Systems with Physics-Constrained Deep Auto-Regressive Networks

Nicholas Geneva, Nicholas Zabaras

PDF

1 Repo

TL;DR

This paper introduces a physics-constrained deep auto-regressive neural network that models nonlinear PDE systems efficiently, requiring minimal training data and providing uncertainty quantification, demonstrated on complex dynamical systems.

Contribution

It presents a novel auto-regressive neural network architecture with a Bayesian framework for modeling PDE systems without training data, reducing computational costs significantly.

Findings

01

Accurately models nonlinear PDE systems like Kuramoto-Sivashinsky and Burgers' equations.

02

Provides uncertainty quantification for predictions at each time step.

03

Outperforms traditional numerical methods in efficiency and data requirements.

Abstract

In recent years, deep learning has proven to be a viable methodology for surrogate modeling and uncertainty quantification for a vast number of physical systems. However, in their traditional form, such models can require a large amount of training data. This is of particular importance for various engineering and scientific applications where data may be extremely expensive to obtain. To overcome this shortcoming, physics-constrained deep learning provides a promising methodology as it only utilizes the governing equations. In this work, we propose a novel auto-regressive dense encoder-decoder convolutional neural network to solve and model non-linear dynamical systems without training data at a computational cost that is potentially magnitudes lower than standard numerical solvers. This model includes a Bayesian framework that allows for uncertainty quantification of the predicted…

Tables6

Table 1. Table 1 : Wall-clock time for both spectral ETDRK4 scheme and AR-DenseED to simulate 5000 5000 5000 time-steps of the Kuramoto-Sivashinsky system. Wall-clock time estimates were obtained by averaging 10 10 10 independent simulation run times.

	Hardware	Backend	$Δ t$	Wall-clock Time (s)
Spectral	Intel Xeon E5-2680	Matlab	$0.1$	$0.185$
AR-DenseED	Intel Xeon E5-2680	PyTorch	$0.1$	$17.042$
AR-DenseED	GeForce GTX 1080 Ti	PyTorch	$0.1$	$12.225$

Table 2. Table 2 : Wall-clock time of finite element, finite difference and AR-DenseED to simulate 400 400 400 time-steps of the 1D-Burgers’ system. Wall-clock time estimates were obtained by averaging 10 10 10 independent simulation run times.

	Hardware	Backend	$Δ t$	Wall-clock (s)
Finite Element	Intel Xeon E5-2680	Fenics	$0.0005$	$43.696$
Finite Element	Intel Xeon E5-2680	Fenics	$0.001$	$22.645$
Finite Element	Intel Xeon E5-2680	Fenics	$0.005$	$7.450$
Finite Difference	Intel Xeon E5-2680	PyTorch	$0.0005$	$4.856$
Finite Difference	GeForce GTX 1080 Ti	PyTorch	$0.0005$	$12.8359$
Finite Difference	Intel Xeon E5-2680	PyTorch	$0.001$	$2.862$
Finite Difference	GeForce GTX 1080 Ti	PyTorch	$0.001$	$6.264$
AR-DenseED	Intel Xeon E5-2680	PyTorch	$0.005$	$1.286$
AR-DenseED	GeForce GTX 1080 Ti	PyTorch	$0.005$	$0.705$

Table 3. Table 3 : Wall-clock time of finite element simulation and AR-DenseED to simulate 200 200 200 time-steps of the 2D coupled Burgers’ system. Wall-clock time estimates were obtained by averaging 10 10 10 independent simulation run times.

	Hardware	Backend	$Δ t$	$Δ x$	Wall-clock (s)
Finite Element	Intel Xeon E5-2680	Fenics	$0.005$	1/128	$2955.38$
Finite Element	Intel Xeon E5-2680	Fenics	$0.005$	1/64	$418.83$
Finite Element	Intel Xeon E5-2680	Fenics	$0.005$	1/32	$133.65$
AR-DenseED	Intel Xeon E5-2680	PyTorch	$0.005$	1/64	$4.691$
AR-DenseED	GeForce GTX 1080 Ti	PyTorch	$0.005$	1/64	$0.841$

Table 4. Table 4 : AR-DenseED and BAR-DenseED training parameters used for the Kuramoto-Sivashinsky system.

Training Parameters		SWAG [55] Parameters
Optimizer	ADAM [48]	Optimizer	ADAM [48]
Weight Decay	$0$	Weight Decay	$0$
Learning Rate	$1 e - 3$	Learning Rate	$1 e - 10$
$β$ Learning Rate	$1 e - 3$	$β$ Learning Rate	$1 e - 6$
Exponential Decay Rate	$0.995$	Collection Rate	$1$ epoch
Training Epochs	$100$	Models Collected	$100$
Training Scenarios	$2560$	Deviation Matrix $H$	10
Mini-batch Size	$256$

Table 5. Table 5 : AR-DenseED and BAR-DenseED training parameters used for the 1D Burgers’ system.

Training Parameters		SWAG [55] Parameters
Optimizer	ADAM [48]	Optimizer	ADAM [48]
Weight Decay	$0$	Weight Decay	$0$
Learning Rate	$1 e - 3$	Learning Rate	$4 e - 8$
$β$ Learning Rate	$1 e - 3$	$β$ Learning Rate	$5 e - 6$
Exponential Decay Rate	$0.99$	Collection Rate	$1$ epoch
Training Epochs	$100$	Models Collected	$90$
Training Scenarios	$2560$	Deviation Matrix $H$	$30$
Mini-batch Size	$256$

Table 6. Table 6 : AR-DenseED and BAR-DenseED training parameters used for the 2D coupled Burgers’ system.

Training Parameters		SWAG [55] Parameters
Optimizer	ADAM [48]	Optimizer	ADAM [48]
Weight Decay	$0$	Weight Decay	$0$
Learning Rate	$1 e - 3$	Learning Rate	$3 e - 8$
$β$ Learning Rate	$1 e - 3$	$β$ Learning Rate	$3 e - 5$
Exponential Decay Rate	$0.995$	Collection Rate	$1$ epoch
Training Epochs	$100$	Models Collected	$90$
Training Scenarios	$5120$	Deviation Matrix $H$	$30$
Mini-batch Size	$128$

Equations97

u (x, t)_{t} + F (x, u (x, t)) = 0, x \in Ω, t \in [0, T], B (u) = b (x, t), x \in Γ, u (x, 0) \sim p (u (x, 0)),

u (x, t)_{t} + F (x, u (x, t)) = 0, x \in Ω, t \in [0, T], B (u) = b (x, t), x \in Γ, u (x, 0) \sim p (u (x, 0)),

u^{n + 1} = f (χ^{n + 1}, w), χ^{n + 1} \equiv {u^{n}, u^{n - 1}, \dots, u^{n - k}},

u^{n + 1} = f (χ^{n + 1}, w), χ^{n + 1} \equiv {u^{n}, u^{n - 1}, \dots, u^{n - k}},

u^{1} = f (χ^{1}, w), u^{2} = f (χ^{2}, w), u^{3} = f (χ^{3}, w), \dots u^{N} = f (χ^{N}, w), χ^{1} = {u_{0}, u_{0}, u_{0}, \dots, u_{0}}, χ^{2} = {u^{1}, u_{0}, u_{0}, \dots, u_{0}}, χ^{3} = {u^{2}, u^{1}, u_{0}, \dots, u_{0}}, χ^{N} = {u^{N - 1}, u^{N - 2}, \dots, u^{N - 1 - k}},

u^{1} = f (χ^{1}, w), u^{2} = f (χ^{2}, w), u^{3} = f (χ^{3}, w), \dots u^{N} = f (χ^{N}, w), χ^{1} = {u_{0}, u_{0}, u_{0}, \dots, u_{0}}, χ^{2} = {u^{1}, u_{0}, u_{0}, \dots, u_{0}}, χ^{3} = {u^{2}, u^{1}, u_{0}, \dots, u_{0}}, χ^{N} = {u^{N - 1}, u^{N - 2}, \dots, u^{N - 1 - k}},

u^{n + 1} = f ({f (χ^{n}, w), f (χ^{n - 1}, w), \dots, f (χ^{n - k}, w)}, w),

u^{n + 1} = f ({f (χ^{n}, w), f (χ^{n - 1}, w), \dots, f (χ^{n - k}, w)}, w),

w arg min f (χ^{n + 1}, w) - T_{Δ t} (U^{n + 1}, F_{Δ x} (\cdot))_{2}^{2},

w arg min f (χ^{n + 1}, w) - T_{Δ t} (U^{n + 1}, F_{Δ x} (\cdot))_{2}^{2},

f (χ^{n}, w) = d eco d er \circ e n co d er (χ^{n}),

f (χ^{n}, w) = d eco d er \circ e n co d er (χ^{n}),

u^{n} = d eco d er (z, w_{d}), z = e n co d er (χ^{n}, w_{e}),

L = \frac{1}{M} j = 1 \sum M i = 1 \sum N \hat{u}_{j}^{i} - u_{j}^{i}_{2}^{2}, \hat{u}_{j}^{i} = T_{Δ t} (U_{j}^{i}, F_{Δ x}),

L = \frac{1}{M} j = 1 \sum M i = 1 \sum N \hat{u}_{j}^{i} - u_{j}^{i}_{2}^{2}, \hat{u}_{j}^{i} = T_{Δ t} (U_{j}^{i}, F_{Δ x}),

\frac{u ^ ^{n + 1} - u ^{n}}{Δ t}

\frac{u ^ ^{n + 1} - u ^{n}}{Δ t}

\hat{u}^{n + 1} = T_{Δ t} (U^{n + 1}, F_{Δ x})

L

L

= \frac{1}{M} i = 1 \sum M u_{i}^{n + 1} - u_{i}^{n} + Δ t F_{Δ x} (x_{i}, u_{i}^{n})_{2}^{2},

= \frac{Δ t ^{2}}{M} i = 1 \sum M \frac{u _{i}^{n + 1} - u _{i}^{n}}{Δ t} + F_{Δ x} (x_{i}, u_{i}^{n})_{2}^{2},

= \frac{Δ t ^{2}}{M} i = 1 \sum M R (u_{i}^{n + 1})_{2}^{2} = \frac{Δ t ^{2}}{M} i = 1 \sum M R (f (χ_{i}^{n + 1}, w))_{2}^{2},

u_{i}^{n + 1} = T_{Δ t} (U^{n + 1}, F_{Δ x}) + E_{Δ t} + E_{Δ x},

u_{i}^{n + 1} = T_{Δ t} (U^{n + 1}, F_{Δ x}) + E_{Δ t} + E_{Δ x},

\hat{u}^{i} = f (χ^{i}, w) + ϵ, p (ϵ) = N (ϵ ∣0, β^{- 1} I_{d}),

\hat{u}^{i} = f (χ^{i}, w) + ϵ, p (ϵ) = N (ϵ ∣0, β^{- 1} I_{d}),

p (\hat{u}^{i} ∣ χ^{i}, w, β)

p (\hat{u}^{i} ∣ χ^{i}, w, β)

= N (T_{Δ t} (U^{i}, F_{Δ x}) ∣ f (χ^{i}, w), β^{- 1} I_{d}) .

p (\hat{u}^{N}, \dots, \hat{u}^{1} ∣ u_{0}, w, β)

p (\hat{u}^{N}, \dots, \hat{u}^{1} ∣ u_{0}, w, β)

= i = 1 \prod N N (T_{Δ t} (U^{i}, F_{Δ x}) ∣ f (χ^{i}, w), β^{- 1} I_{d}) .

p (β) = Γ (β ∣ a_{1}, b_{1}),

p (β) = Γ (β ∣ a_{1}, b_{1}),

\hat{u}^{i}

\hat{u}^{i}

= T_{Δ t} (U^{i}, F_{Δ x}) + c_{0} (Δ t)^{k_{0}} + O ((Δ t)^{k_{1}}),

p (w ∣ α) = N (w ∣0, α^{- 1} I_{K}), p (α) = G amma (α ∣ a_{0}, b_{0}),

p (w ∣ α) = N (w ∣0, α^{- 1} I_{K}), p (α) = G amma (α ∣ a_{0}, b_{0}),

p (w, β ∣ S)

p (w, β ∣ S)

\sim j = 1 \prod M i = 1 \prod N [p (\hat{u}_{j}^{i} ∣ χ_{j}^{i}, w, β)] p (w ∣ α) p (α ∣ a_{0}, b_{0}) p (β ∣ a_{1}, b_{1}),

\displaystyle\begin{multlined}\sim\prod^{M}_{j=1}\prod_{i=1}^{N}\left[\mathcal{N}\left(T_{\Delta t}\left(\bm{\mathcal{U}}_{j}^{i},F_{\Delta x}\right)|f\left(\bm{\chi}_{j}^{i},\textbf{w}\right),\beta^{-1}\bm{I}_{d}\right)\right]\mathcal{N}\left(\mathbf{w}|0,\alpha^{-1}\bm{I}_{K}\right)\\ \qquad\Gamma\left(\alpha|a_{0},b_{0}\right)\Gamma\left(\beta|a_{1},b_{1}\right).\end{multlined}\sim\prod^{M}_{j=1}\prod_{i=1}^{N}\left[\mathcal{N}\left(T_{\Delta t}\left(\bm{\mathcal{U}}_{j}^{i},F_{\Delta x}\right)|f\left(\bm{\chi}_{j}^{i},\textbf{w}\right),\beta^{-1}\bm{I}_{d}\right)\right]\mathcal{N}\left(\mathbf{w}|0,\alpha^{-1}\bm{I}_{K}\right)\\ \qquad\Gamma\left(\alpha|a_{0},b_{0}\right)\Gamma\left(\beta|a_{1},b_{1}\right).

p (θ ∣ S) \sim N (θ_{S W A}, Σ_{S W A}), θ \equiv {w, l n (β)},

p (θ ∣ S) \sim N (θ_{S W A}, Σ_{S W A}), θ \equiv {w, l n (β)},

θ_{S W A} = \frac{1}{S} i = 1 \sum S θ_{i}, Σ_{S W A} = \frac{1}{2} (Σ_{D ia g} + Σ_{l r}),

θ_{S W A} = \frac{1}{S} i = 1 \sum S θ_{i}, Σ_{S W A} = \frac{1}{2} (Σ_{D ia g} + Σ_{l r}),

\overset{ˉ}{θ}^{2} = \frac{1}{S} i = 1 \sum S θ_{i}^{2}, Σ_{D ia g} = Diag (\overset{ˉ}{θ^{2}} - θ_{S W A}^{2}),

Σ_{l r} = \frac{1}{K - 1} D D^{T}, D_{i} = (θ_{i} - θ_{S W A}),

p (\hat{u}^{*} ∣ χ^{*}, S)

p (\hat{u}^{*} ∣ χ^{*}, S)

\approx \frac{1}{P} i = 1 \sum P p (\hat{u}^{*} ∣ f (χ^{*}, w_{i}), β_{i}^{- 1} I), {w_{i}, β_{i}} \sim p (θ ∣ S),

E [\hat{u}^{*} ∣ χ^{*}, S]

E [\hat{u}^{*} ∣ χ^{*}, S]

= E_{p (w ∣ S)} [f (χ^{*}, w)] \approx \frac{1}{P} i = 1 \sum P f (χ^{*}, w_{i}),

Cov [\hat{u}^{*} ∣ χ^{*}, S]

Cov [\hat{u}^{*} ∣ χ^{*}, S]

= E_{p (l n (β) ∣ S)} [β^{- 1} I] + Cov_{p (w ∣ S)} (f (χ^{*}, w)),

\approx \frac{1}{P} i = 1 \sum P [β_{i}^{- 1} I + f (χ^{*}, w_{i}) f (χ^{*}, w_{i})^{T}] - E [\hat{u}^{*} ∣ χ^{*}, S] E [\hat{u}^{*} ∣ χ^{*}, S]^{T},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cics-nd/ar-pde-cnn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Modeling the Dynamics of PDE Systems with Physics-Constrained Deep Auto-Regressive Networks

Nicholas Geneva

[email protected]

Nicholas Zabaras

[email protected]

[

Center for Informatics and Computational Science, University of Notre Dame, 311 Cushing Hall, Notre Dame, IN 46556, USA

Abstract

In recent years, deep learning has proven to be a viable methodology for surrogate modeling and uncertainty quantification for a vast number of physical systems. However, in their traditional form, such models can require a large amount of training data. This is of particular importance for various engineering and scientific applications where data may be extremely expensive to obtain. To overcome this shortcoming, physics-constrained deep learning provides a promising methodology as it only utilizes the governing equations. In this work, we propose a novel auto-regressive dense encoder-decoder convolutional neural network to solve and model non-linear dynamical systems without training data at a computational cost that is potentially magnitudes lower than standard numerical solvers. This model includes a Bayesian framework that allows for uncertainty quantification of the predicted quantities of interest at each time-step. We rigorously test this model on several non-linear transient partial differential equation systems including the turbulence of the Kuramoto-Sivashinsky equation, multi-shock formation and interaction with 1D Burgers’ equation and 2D wave dynamics with coupled Burgers’ equations. For each system, the predictive results and uncertainty are presented and discussed together with comparisons to the results obtained from traditional numerical analysis methods.

keywords:

Physics-Informed Machine Learning , Auto-regressive Model , Deep Neural Networks , Convolutional Encoder-Decoder , Uncertainty Quantification , Dynamics , Partial Differential Equations

††journal: Journal of Computational Physics

url]https://cics.nd.edu/

1 Introduction

In almost all scientific domains, simulating systems of partial differential equations (PDEs) is of great importance and research interest. Given that many physical phenomena including heat diffusion, fluid dynamics, and elasticity are formalized with PDEs, numerically or analytically solving these governing equations is a core foundation for a vast spectrum of scientific and engineering disciplines. In recent decades exponential growth in computational power has made such numerical methods for solving PDEs even more accessible. However, in most modern-day applications, obtaining the desired resolution or accuracy with such simulations is still computationally expensive. Hence, many seek to strike an ideal balance between predictive accuracy and computational efficiency. In many situations, such as optimization or inverse problems, a large number of repeated simulations are required prioritizing the computational efficiency of the numerical simulator. Often surrogate models are used to ease this computational burden by providing a fast approximate model that can imitate a standard numerical solver at a significantly reduced computational cost.

In recent years, machine learning and deep learning have entered a renaissance in which groundbreaking findings have made deep learning models widely successful for a vast number of applications [1]. One such application is surrogate modeling in which a deep learning model can be used as a black box method to approximate a physical system. Among the most popular deep learning models is deep neural networks (DNNs), which have proven to be an extremely effective method for modeling a wide spectrum of physical systems such as flow through porous media [2, 3, 4], Navier-Stokes equations [5], turbulence modeling [6], molecular dynamics [7] and more. Traditional DNNs are not probabilistic in nature resulting in Bayesian extensions of these models [8, 9] to quantify the underlying uncertainty in these black box algorithms. While DNNs have been proven to be both accurate and computationally efficient for modeling and uncertainty quantification (UQ), it is commonly known that training such models may require a significant amount of data. Depending on the system of interest, training data may either be sparse, extremely expensive to obtain or not available at all. Considering that the underlying governing equations are known, in this work, we are particularly interested in the surrogate modeling of physical systems using physics-constrained loss functions. Such loss functions allow a surrogate model to be trained in the absence of data (e.g. without having to solve the equations governing the system of interest).

The philosophy of learning ordinary or partial differential equations through constraint based loss functions is far from a new idea with related works reaching back over two decades ago [10, 11, 12, 13]. These early works focused on solving initial/boundary value problems in which the solution is parameterized by a fully-connected network which allows for a fully differentiable and closed analytic form [13]. With the resurgence of interest in neural networks, such techniques have been rediscovered by multiple works in recent years where this core idea has been expanded upon. As discussed in the work of Lagaris et al. [13, 14] and later revisited by Raissi et al. [15], the use of fully connected networks with physics-constrained learning allows for a mesh free solution that can be evaluated anywhere on the domain while being trained on only a few points. Additionally, Lagaris et al. [14] and more recently Berg and Nyström [16] showed fully connected networks can be used to learn PDE solutions on even complex domains. Recently, several investigators have examined the use of variational formulations of the governing equations as loss functions to solve various PDEs [3, 17, 18, 19] which has been proven to be effective. Sirignano et al. [20] show that the use of a fully connected network can be used for efficiently solving PDEs of high dimensionality where traditional discretization techniques become unfeasible. Several have also investigated the use of fully connected networks to solve high-dimensional stochastic PDEs with good success [21, 22]. While fully connected networks could be optimized to compute a single solution of a PDE, several challenges remain in extending these ideas to surrogate model construction. For example, if the initial condition, boundary conditions, material properties, etc. are changed the model must be retrained. This means that from a computational aspect, such methods are difficult to justify if a numerical simulator can be used that is computationally less expensive than training the fully connected network. Clearly, with decades of numerical analysis progress this issue is applicable to an overwhelmingly large amount of PDE systems.

To the authors best knowledge only the works of Zhu et al. [3] and Karumuri et al. [19] seek to build surrogate models using physics-constrained, data free learning. In [3], a deep convolutional neural network was used to formulate a surrogate model for an elliptic PDE with a stochastic, high-dimensional permeability field. Additionally, Zhu et al. proposed a probabilistic framework based on a conditional flow-based generative model [23] to quantify the potential error arising from the model itself. It was also found that the data-less physics-constrained learning yielded a model with much better generalization capabilities than traditional data-driven learning. Karumuri et al. [19] used a deep fully connected ResNet [24] to build a surrogate also for elliptic PDEs with reasonable success. Note that both of these aforementioned works have been focused entirely on PDEs which are not time-dependent in nature.

In this work, we generalize these physics-constrained deep learning surrogate models to dynamical PDEs. The novel contributions of this paper are as follows: (a) We propose a deep auto-regressive dense encoder-decoder for predicting transient PDEs and the physics-constrained training algorithm; (b) Extend this model to a Bayesian framework using the recently proposed stochastic weight averaging Gaussian algorithm to quantify both epistemic and aleatoric uncertainty; (c) Implement this model for a chaotic/turbulent system, a system with multiple shock wave interactions and a 2D system of coupled non-linear PDEs far surpassing the complexity of other test cases shown in past literature; (d) Present and discuss the accuracy of the predictions as well as the associated uncertainty for each of the previously discussed PDEs; and (e) Compare the computational efficiency of the proposed surrogate model against other state-of-the-art numerical methods.

This paper is organized as follows: First, in Section 2, we briefly define and discuss the problem of interest. Section 3 discusses the auto-regressive dense encoder-decoder model, its training and use as a surrogate model. In Section 4, we extend this deep learning model to the Bayesian paradigm where we discuss the formulation of the posterior as well as the approximation of the predictive distribution. Following, in Section 5, the proposed model is implemented for the chaotic Kuramoto-Sivashinsky system and a study is presented of the turbulent statistics that the model produces. In Section 6, we also explore the use of the auto-regressive model for the prediction of shocks in the 1D Burgers’ equation. Later in Section 7, we further extend this to the 2D coupled Burgers’ system. Lastly, conclusions and discussion can be found in Section 8. All code, trained models and data used in this work is open-sourced for full reproducibility.111Code available at: https://github.com/cics-nd/ar-pde-cnn.

2 Problem Definition

In this work, we are interested in using deep learning architectures for developing surrogate models of non-linear dynamical systems that evolve in both space and time using physics-constrained learning. Specifically, we wish to use the governing equations to formulate loss functions for training surrogate models without the need of (output) training data. Our goal is to develop surrogate models for a class of arbitrary transient PDE systems with an unknown, variable or stochastic initial state. Consider a transient system of PDEs that models a physical system:

[TABLE]

where we have denoted this $n$ -dimensional PDE by the temporal derivative $\bm{u}(\bm{x},t)_{t}$ and the remaining terms by $F(\cdot)$ which includes spatial derivatives and non-linear terms. $\bm{u}(\bm{x},t)$ are the system’s state variables in the domain $\Omega$ with a boundary $\Gamma$ . $\mathcal{B}$ is the boundary operator that enforces the desired boundary conditions. Lastly, the initial state $\bm{u}(\bm{x},0)$ is a real valued random field with a probability density, $p\left(\bm{u}(\bm{x},0)\right)$ , that may or may not be known.

Our goal is to expand on the work in [3] in which PDE solutions were represented as an optimization problem by either minimizing an energy functional or alternatively the square of the PDE residual [25]. The objective in [3] was to predict quantities of interest for an elliptic PDE (defining Darcy’s flow) in an image-to-image regression approach using a convolutional encoder-decoder architecture with an input being a property field (permeability) and the output being the quantities of interest (pressure and velocity). The use of a convolutional neural network proved to have some significant benefits over the more commonly used fully-connected networks including faster convergence and better accuracy. While successful for elliptic PDEs, the strategies in this past work cannot directly generalize to a dynamical system.

If one were developing a numerical algorithm to solve a dynamical system, the first step would be to discretize the time derivative, which is commonly referred to as a time-stepping or time integration method [26]. For time integration there are a vast number of options including standard explicit or implicit methods, Runge-Kutta methods, linear multi-step methods, implicit-explicit methods and more. However, the goal of all these techniques is the same: evolve the system from time $t$ to time $t+\Delta t$ . Using this philosophy of discrete time integration, we propose building a surrogate model that performs time integration at a specified $\Delta t$ in an image-to-image regression algorithm using a convolutional encoder-decoder neural network. Let us consider $\bm{u}^{n}$ as the solution of the PDE with $d_{0}$ state variables on a given structured Euclidean discretization of $\Omega$ at time-step $n$ . Namely, given $\Omega$ discretized with $D_{i}$ points in the $i$ -th dimension, $\bm{u}^{n}\in\mathbb{R}^{d_{0}\times D_{1}}$ for a 1D system, $\bm{u}^{n}\in\mathbb{R}^{d_{0}\times D_{1}\times D_{2}}$ for a 2D system and $\bm{u}^{n}\in\mathbb{R}^{d_{0}\times D_{1}\times D_{2}\times D_{3}}$ for a 3D system. Our convolutional encoder-decoder model for simulating time integration at time-step $n$ is parameterized as follows:

[TABLE]

where $f$ represents the function learned by the deep learning model, w are the learnable parameters in this convolutional neural network. $\bm{\chi}^{n+1}$ is the model’s input, for the prediction $\bm{u}^{n+1}$ , consisting of the $k+1$ previous states of the system. By this model definition, we are interested in learning the dynamics or evolution of the system invariant to the current time $t$ . The use of a convolutional neural network allows for a light-weight model that can evolve the system of interest by a discrete time-step efficiently without any matrix inversions, iterative relaxations or multi-step processes. Similar to the convolutional model in Zhu et al. [3], this model can be used/extended for tasks such as solving PDEs, surrogate modeling and performing uncertainty quantification.

To predict a given system’s response for $N$ time-steps, the convolutional neural network is executed as an auto-regressive model. Given just a discretized initial state of the system $\bm{u}_{0}$ , one can predict the system response as:

[TABLE]

where the model must be executed $N$ times to obtain the prediction of the system at time-step $N$ . Note, that the initial input to the model $\bm{\chi}^{1}$ is comprised of just the initial state, which is an approximation needed to “kick-start” the time series. The prediction for a particular time-step can be formulated as a recursive function of the model:

[TABLE]

where the input $\bm{\chi}^{n+1}$ is formulated in terms of the model itself, with inputs that can be described in a similar manner. Thus only the initial state is needed for predicting a systems’ response up to an arbitrary number of time-steps. For the prediction of an entire time series $\left[\bm{u}^{N},\bm{u}^{N-1},\ldots\bm{u}^{1}\right]$ , the model can be represented as a set of functions $\hat{f}(\bm{u}_{0},\textbf{w})=\left\{f\left(\bm{\chi}^{N},\textbf{w}\right),f\left(\bm{\chi}^{N-1},\textbf{w}\right),\ldots,f\left(\bm{\chi}^{1},\textbf{w}\right)\right\}$ in which each can be expressed recursively as a function of the initial state.

As discussed previously, we would like to formulate a methodology of physics-constrained learning such that the model can be trained with only a set of initial states $\bm{u}_{0}$ . Although the same model is used to predict every time-step of a given time series as shown in Eq. (3), the core building block for training this DNN model is learning how to predict the transition of states from $t$ to $t+\Delta t$ regardless of the reference time $t$ . For physics-constrained learning, we will pose the optimization problem for a single time-step, $n\rightarrow n+1$ , as the minimization of the discrepancy between the model’s prediction and the prediction of a discrete numerical time integration method $T_{\Delta t}$ of time-step $\Delta t$ for the governing equation of interest:

[TABLE]

in which the systems states $\bm{u}^{n+1}$ are parameterized by the DNN. The time-integrator predicts numerically the state at the next time-step given some system states, $\bm{\mathcal{U}}^{n+1}$ , and a discretized form of the additional terms of the PDE $F_{\Delta x}$ . The exact definition of $\bm{\mathcal{U}}^{n+1}$ depends on the time integration method used. For example, for the explicit forward Euler scheme, $\bm{\mathcal{U}}^{n+1}=\left\{\bm{u}^{n}\right\}$ depends only on the previous time-step state while for the Crank-Nicolson scheme, $\bm{\mathcal{U}}^{n+1}=\left\{\bm{u}^{n+1},\bm{u}^{n}\right\}$ , depends on both the model’s current prediction as well as the state at the previous time-step. As it will be discussed in greater detail in the following sections, given that $T_{\Delta t}$ is consistent with the governing equation, this minimization can be interpreted as the minimization of the residual of the discretized PDE.

2.1 Surrogate Modeling of Dynamical Systems

In the context of this work, we are focused on developing a surrogate model that can efficiently predict a dynamical system’s response $\bm{y}=\left[\bm{u}^{N},\bm{u}^{N-1},\ldots\bm{u}^{1}\right]$ for time-steps $[1,N]$ for a given initial state realization $\bm{u}_{0,i}\sim p\left(\bm{u}_{0}\right)$ and a set of boundary conditions. We pose the following definition for this surrogate model:

Definition 2.1.

(Deterministic Surrogate Model) For a transient PDE system with specified boundary conditions, as in Eq. (1), and a finite set of initial conditions $\mathcal{S}=\left\{\bm{u}_{0,i}\right\}_{i=1}^{M}$ , $\bm{u}_{0,i}\sim p\left(\bm{u}_{0}\right)$ , train a surrogate to predict the dynamical response $\bm{y}=\hat{f}\left(\bm{u}_{0}^{*},\textbf{w}\right)$ for any initial condition, $\bm{u}_{0}^{*}\sim p\left(\bm{u}_{0}\right)$ , such that the predicted response is the solution of the governing PDEs for the respective initial state.

The true density of $p\left(\bm{u}_{0}\right)$ may not be known. For example, $p\left(\bm{u}_{0}\right)$ may represent a set of states collected from an experiment or simulation. When this density is not known or samples are not available, one may need to approximate it for the sake of assembling a set of initial states for training. In the context of this work, we will pose this initial condition as a random function from which we can sample from to illustrate the applicability of our model. As discussed in Zhu et al. [3], surrogate modeling using physics-constrained learning can be interpreted as unsupervised learning since training takes place without any labeled training data. Rather it is up to the model to discover the dynamics of the system.

In the majority of problems of interest, the number of (initial) training data used will only express a portion of all the inputs that can be drawn from the density function $p\left(\bm{u}_{0}\right)$ . Thus the surrogate model will only be trained to predict a part of all potential responses of the dynamical system. To account for this limited expressibility of both the input data used for training as well as the trained model itself, we also wish to formulate a probabilistic surrogate than can produce distributions over possible solutions, rather than a single point estimate. Hence, we pose the following definition for the probabilistic extension of this dynamical surrogate model:

Definition 2.2.

(Probabilistic Surrogate Model) For a transient PDE system with specified boundary conditions, such as Eq. (1), and a finite set of initial conditions $\mathcal{S}=\left\{\bm{u}_{0,i}\right\}_{i=1}^{M}$ , $\bm{u}_{0,i}\sim p\left(\bm{u}_{0}\right)$ , train a surrogate to predict the dynamical response density, $p\left(\bm{y}|\bm{u}_{0}^{*},\mathcal{S}\right)$ , such that the samples drawn from the predictive distribution satisfy the governing PDEs for the respective initial state.

3 Auto-regressive Dense Encoder-Decoder

The prediction of time series is a classical machine learning problem with many models specifically designed for such tasks most predominately seen in the field of neural language processing. Among the most classical methods are standard recurrent neural networks which tend to be difficult to train [27], as well as long-short term memory (LSTM) architectures [28]. Recent advances of modeling time series include hierarchical networks [29], attention networks [30] and transformer networks [31]. For modeling dynamical systems, we propose the following auto-regressive dense encoder-decoder model (AR-DenseED) illustrated in Fig. 1 which does not rely on any latent variable recurrent connections or specific gate design as seen in LSTMs [28]. The key philosophy of AR-DenseED is to efficiently model time integration by learning how to evolve the system forward in time given $k+1$ previous states. In a true auto-regressive nature, the model predicts the dynamics of the system through sequential forward passes using the previous predictions as inputs as outlined in Algorithm 1. This is shown in Fig. 2 where an AR-DenseED using three previous time-steps as inputs predicts five time-steps into the future.

*Remark 1**.*

The number of previous time-steps used in the model’s input, $\bm{\chi}^{n+1}\equiv\left\{\bm{u}^{n},\bm{u}^{n-1},\ldots,\bm{u}^{n-k}\right\}$ , is a tunable hyper-parameter that can be adjusted depending on the system of interest. For all of the numerical examples tested, we found that including multiple time-steps in the input (i.e. $k\geq 1$ ) was essential for improving training stability. However, using too many past time-steps in $\bm{\chi}$ slows training and does not yield considerable predictive improvements.

AR-DenseED is built using successive layers of encoding/decoding convolutions and dense blocks originally proposed for solving elliptic systems in [2, 3]. The convolutional encoder-decoder can be parameterized as the function composition:

[TABLE]

where $\left\{\textbf{w}_{e},\textbf{w}_{d}\right\}=\textbf{w}$ are the encoder and decoder parameters, respectively. $\bm{z}$ are latent variables that are of lower dimensionality than $\bm{u}^{n}$ . The input channel dimensionality of the convolutional model is the product of the number of state-variables $d_{0}$ and the number of previous time-steps used in $\bm{\chi}^{n}$ . The number of output channels is equivalent to the number of state-variables that are predicted at a given time-step. While we use $f\left(\bm{\chi}^{n},\textbf{w}\right)$ to encapsulate this process, one can interpret this model as learning two processes: encoding data from previous time-steps to a latent variable $\bm{z}$ and the prediction of the next state as a function of $\bm{z}$ . We choose to use convolutional neural networks largely because they have been shown to yield faster convergence and better predictive capability for physics-constrained training compared to fully connected networks [3]. Additionally, convolutional neural networks require significantly less learnable weights than fully connected networks due to parameter sharing which allows for faster predictions [32]. Computational efficiency during test time is imperative for surrogate modeling where prediction speed of the surrogate needs to outperform a numerical solver to justify its use. Extensions of convolutional neural networks to unstructured and non-Euclidean domains can be achieved through geometric deep learning [33], an emerging field that focuses on extending convolutional operators past structured data. This will clearly require approximating the spatial gradients of the PDE on unstructured grids.

3.1 Physics-Constrained Loss Function

To train this model, we will be extending the previous work of Zhu et al. [3] where governing PDEs are used to formulate a loss function. The model is trained such that its predictions satisfy the governing equations of the system requiring only initial states, $\bm{u}_{0}$ . For clarity, we will refer to these initial states as training scenarios. Unlike works that have used fully connected networks for learning PDEs solutions in a similar manner (e.g. [11, 13, 15, 19]), our convolutional neural network surrogate requires the gradients in the governing PDEs to be numerically approximated. In past works regarding the surrogate modeling of elliptic PDEs [3], finite-difference based approximations were used to compute spatial gradients. These approximations were found to be very effective and computationally efficient. Thus a similar approach will be used for spatial derivatives in this work.

In dynamical systems this leaves one more critical component: the time-integrator or time-stepping method [26]. As previously outlined in Eq. (5), we will pose the optimization of this model for a given time-step in terms of minimizing the difference between the model’s predictions and a discrete numerical time-integrator $T_{\Delta t}$ of the governing PDE. The standard $L_{2}$ loss for a series of $N$ time-steps and a mini-batch of $M$ training scenarios, $\mathcal{S}=\left\{\bm{u}_{0,i}\right\}_{i=1}^{M}$ , can be posed as follows:

[TABLE]

where $\bm{u}^{i}$ is the prediction from the neural network $f\left(\bm{\chi}^{i},\textbf{w}\right)$ making the loss implicitly dependent on all states within $\bm{\chi}^{i}$ and $\hat{\bm{u}}^{i}$ is the “target” calculated using the numerical time-integrator. The $L_{2}$ norm corresponds to a finite integral over the entire domain $\Omega$ . It is important to recognize that the discretization of both the time and spatial derivatives has introduced numerical truncation error into our loss function. These errors in the deterministic case will ultimately be neglected, however we will expand on the idea of numerical error in the Bayesian model in Section 4.

This formulation allows for any time integration algorithm to be used, making it very versatile. Thus one can select a numerical integrator that has the desired properties regarding stability, computational cost and accuracy. In this work, we are interested in predicting large time-steps or when the model represents a time-step, $\Delta t$ , resulting in a Courant-Friedrichs-Lewy (CFL) number greater than one. In this regime, explicit methods are fundamentally unstable for hyperbolic PDEs thus one must resort to costly implicit methods which require expensive matrix inversions [26, 34]. Rather than performing a matrix inversion as is the case for an implicit time integration scheme, the neural network’s predictions for the next time-step are used when evaluating the spatial gradients in $T_{\Delta t}$ . This allows for an implicit like time integration without the need for matrix inversions during optimization. Alternatively, $T_{\Delta t}$ could encapsulate multiple explicit calculations each at smaller time-steps. This strategy was not investigated in this work but could allow for greater accuracy at the cost of additional computation.

*Remark 2**.*

The use of discretization methods makes the model vulnerable to the same numerical issues that plague each numerical approximation. Specifically with regards to the time-integrator, while an implicit scheme may be stable for very large time-steps this comes with the implications of reduced accuracy which should be considered. However, the parametrization of the solution as a neural network potentially relaxes the traditional numerical analysis constraints regarding stability and accuracy. Thus this model could allow for large time-step predictions at a greater precision than vanilla numerical methods.

A point that remains to be seen is if this optimization function will lead to the solution of the PDE. By substituting the time-integrator $T_{\Delta t}$ into the loss, one can arrive at the following proposition:

*Proposition 1**.*

The $L_{2}$ minimization between the model’s prediction and a consistent numerical time integration method is analogous to the $L_{2}$ minimization of the discretized PDE residual.

Intuitively, this is a logical statement since time integration methods are formulated on discretizing the temporal derivative. We can easily show this with a simple example for a single time-step from $n\rightarrow n+1$ . Consider the standard forward Euler time integration scheme:

[TABLE]

where $\bm{\mathcal{U}}^{n+1}=\left\{\bm{u}^{n}\right\}$ . Substituting this into the loss function in Eq. (8):

[TABLE]

we arrive at the minimization of the residual, $\mathcal{R}$ , for the discretized PDE given the neural network’s prediction. The same result can be obtained for implicit time integration methods as well. For example, this can easily be seen for the implicit backward Euler scheme by replacing $F_{\Delta x}\left(\bm{x}_{i},\bm{u}^{n}_{i}\right)$ in Eq. (11) with $F_{\Delta x}\left(\bm{x}_{i},\bm{u}^{n+1}_{i}\right)$ where $\bm{u}^{n+1}_{i}$ is the model’s prediction. Thus, if we use a time integration method that is consistent with the governing PDE, this optimization function minimizes the residual of the PDE across the entire domain leading to the solution of the discretized PDE. After all, the minimization of the residual is the foundation for many iterative numerical methods (e.g. [35, 36]) and has been empirically shown to be very effective for learning PDEs with deep learning models in past works [3, 13, 15].

To enforce boundary conditions, one can introduce an additional loss term that enforces the desired response at the domain boundary. This has been successfully implemented in past work for elliptic PDEs [3] and can easily be generalized to dynamical problems. The systems in this work all have periodic boundaries which are enforced by using circular padding in PyTorch [37] for all model convolutions as well as when evaluating the loss. This is the equivalent to the use of “ghost nodes” in numerical simulations for enforcing periodicity. Additional loss terms can be added to impose other prior physical constraints or conservation laws (e.g. solenoidality or mass conservation). Alternatively, one can modify the network architecture directly to impose constraints such as positivity and invariances [6, 38].

3.2 AR-DenseED Training

This model has several important advantages that we will leverage. The first is that time is not an explicit input, thus this model has no fundamental limitation on its predictive range or initial conditions making it easier to train and better at extrapolation. This is a key advantage of this model compared to the standard fully-connected networks that use time as a discrete input which fundamentally limits their predictive capabilities [13, 15, 39]. During training, the model is allowed to progressively explore the system and learn the dynamics by slowly “unrolling” the number of time-steps it predicts as training progresses. For example, at the beginning of training, the model only predicts a few time-steps from its initial state, however this may increase to hundreds of time-steps as training continues. This allows the model to discover and learn dynamics that may be absent from the provided initial states. Additionally, since the model’s output prediction is then the input for the next time-step, we can back-propagate through multiple time-steps as illustrated in Fig. 3 without latent variable recurrent connections. Allowing the model to auto-regress itself forward in time and compute back-propagation through multiple time-steps promotes the learning of continuous time series. In practice, we only back-propagate through a small number of time-steps to avoid vanishing gradient issues. The training process is outlined in Algorithm 2.

4 Bayesian AR-DenseED

A challenge of physics-constrained learning with no output training data is developing a meaningful probabilistic framework. In past works, a probabilistic surrogate was proposed using the Boltzmann distribution as a reference density and minimizing the Kullback-Leibler divergence for a generative model [3, 40]. While such methodologies represent some built in uncertainty in the model and can yield reasonable error bars, the true interpretation of the resulting uncertainty is much less concrete. Thus in this work, we propose a novel Bayesian framework for physics-constrained models that allow for interpretable uncertainty measures to be produced in the absence of training data.

To formulate the Bayesian AR-DenseED (BAR-DenseED) model, we wish to account for the two major uncertainty components: aleatoric uncertainty which quantifies noise in the observations and epistemic uncertainty which captures inherit uncertainty in the model [41]. Epistemic uncertainty is associated with the confidence of the model’s predictions which is influenced by factors such as limited training scenarios, limited expressibility of the model, etc. For DNNs, epistemic uncertainty is most commonly captured by placing priors on the parameters of the model often being formulated as a Bayesian neural network [9]. Aleatoric uncertainty involves the noise that potentially exists in the data on which the model is trained on, and is often captured by placing a distribution over the model’s outputs [42]. In a data-driven sense, aleatoric uncertainty arises from the simulators or sensors used to collect the training data. In the physics-constrained learning paradigm, we will interpret aleatoric uncertainty as the quantification of error associated with the truncation error introduced when formulating the physics-constrained loss function. As discussed in Section 3.1, the physics-constrained optimization of the auto-regressive model is posed as the minimization of the error between the model’s predictions and a numerical time-integrator of the same time-step size. However, this numerical time-integrator introduces truncation error:

[TABLE]

where $E_{\Delta t}$ and $E_{\Delta x}$ denote the error associated with the discretization of the temporal and spatial derivatives, respectively. In the deterministic case, such errors are neglected, however, depending on the resolution of the spatial discretization or the time-step size these errors can impact a numerical solver’s accuracy. For most numerical solvers using explicit time integration schemes, the bulk of this error arises from the discretization of the spatial derivatives. Alternatively, when predicting large time-steps with implicit methods the discrete time integration can become the primary source of error due to numerical diffusion [43].

4.1 Posterior Formulation

With an idea of the sources of uncertainty we wish to account for, let us start with defining a posterior over the model parameters. In a data-driven model, the likelihood captures the probability of the observations for a given model. For physics-constrained learning where no observations are available, we take as “target data” the prediction of the numerical time-integrator $\hat{\bm{u}}$ . Similar to data-driven probabilistic models [2, 6], we account for the potential discretization error that may arise from $T_{\Delta t}$ through additive output-wise noise, $\epsilon$ , for a single arbitrary time-step:

[TABLE]

where for mathematical convenience, we represent the discretized state variables $\bm{u}$ and $\hat{\bm{u}}$ as vectors in $\mathbb{R}^{d}$ . $\bm{I}_{d}$ denotes the identity matrix in $\mathbb{R}^{d\times d}$ . The additive noise is taken as Gaussian with a learnable precision $\beta$ . Thus the likelihood for a single time-step, $i$ , becomes the following:

[TABLE]

Both the inputs to the model $\bm{\chi}^{i}$ and time-integrator $\bm{\mathcal{U}}^{i}$ are found from the deterministic evolution of the model until time-step $i$ . Under the Markov assumption, we can formulate the likelihood of an entire time-sequence as the product of individual steps:

[TABLE]

Note that the number of past time-steps contained in the model’s input, $\bm{\chi}^{i}$ , is analogous to the order of the Markov chain. In this likelihood, the numerical time-integrator, $T_{\Delta t}$ , can be interpreted as calculating the target $\hat{\bm{u}}^{i}$ on-the-fly. Thus the evaluation of this likelihood function is in fact still data-less.

*Remark 3**.*

To find the maximum likelihood estimate (MLE), minimization of the negative log likelihood is often taken as the optimization objective [44]. In this likelihood formulation, when minimizing the negative log likelihood in Eq. (15), one recovers the standard $L_{2}$ loss as previously shown in Eq. (8) for the deterministic model. Thus the MLE is equivalent to the minimization of the strong residual of the discretized PDE for both implicit and explicit time integration schemes indicating the appropriateness of the selected likelihood for our physics-constrained model.

A gamma prior is assigned to the noise precision $\beta$ :

[TABLE]

where $a_{1}$ and $b_{1}$ are the shape and rate parameters, respectively. The hyper-parameters of the $\beta$ prior are set based on the a priori estimate of the magnitude of the discretization error of $T_{\Delta t}$ and $F_{\Delta x}$ . Since we intend to build large time-step surrogate models with $CFL>1$ , we will assume that the majority error arises from the time integration method. However, the following procedure can easily be extended to account for spatial discretization error as well.

Given an arbitrary system, we can express the temporal truncation error in the following formula which is standard in Richardson extrapolation [45]:

[TABLE]

where $c_{i}$ are unknown constants, and $k_{i}$ are constants denoting the “order” of the error term such that $\left(\Delta t\right)^{k_{i}}>\left(\Delta t\right)^{k_{i+1}}$ . Under the assumption that higher-order terms are negligible, we take the expected value of our prior to be $\mathbb{E}\left(\beta^{-1}\right)=c_{0}\left(\Delta t\right)^{k_{0}}$ . Additionally this prior is given a large variance to reduce its strength. The parameters $c_{0}$ and $k_{0}$ can be estimated based on the magnitude of the state variables and the order of accuracy of the temporal discretization, respectively [46]. In this work, $c_{0}$ is set to be approximately $20\%$ the maximum value of the quantity of interest and $k_{0}$ is set to three which over estimates the accuracy of the second-order accurate time integration method used. This encourages the model to be more accurate at the start of training rather than attributing initial prediction discrepancies to noise.

As is standard for Bayesian neural networks, the network’s $K$ learnable parameters, w, are treated as random variables. Due to the large number of weights in our model, we propose a fully factorizable zero mean Gaussian with a Gamma-distributed precision scalar $\alpha$ :

[TABLE]

where the rate parameter $a_{0}$ and the shape parameter $b_{0}$ are $0.5$ and $10$ , respectively. This results in a prior with a Student’s $\mathcal{T}$ density centered at zero that has a wider support region than a standard Gaussian. In our past works [2, 6], a narrow Student’s $\mathcal{T}$ -distribution was used to more strongly promote sparsity [47], however it was found that a narrow prior would damage the predictive capability of BAR-DenseED. Thus the sparsity requirement is relaxed while still regulating the magnitude of the model’s weights. We note that when one uses an optimizer with momentum and weight decay, such as ADAM [48], an implicit prior on the weights is enforced which is largely ambiguous [49]. Since we will ultimately approximate the posterior, this prior does not need to be accounted for in the formulation of the joint posterior used for optimization. As a result for a batch of $M$ i.i.d. training scenarios, $\mathcal{S}=\left\{\bm{u}_{0,i}\right\}_{i=1}^{M}$ , the posterior of the network is as follows:

[TABLE]

*Remark 4**.*

The task of computing the maximum a posteriori probability (MAP) estimate is closely related to maximizing the likelihood with the addition of appropriate weight regularization that arises from the use of priors on the model’s parameters [44]. Thus, the MAP estimation minimizes a regularized form of the previously considered $L_{2}$ deterministic loss function that was defined based on the discretized PDE residual.

4.2 Posterior Approximation

The Bayesian paradigm seeks to represent model uncertainty by marginalizing out the model’s parameters which results in the predictive distribution. This marginalization is often not analytically tractable and is usually approximated with Monte Carlo sampling of the posterior. In earlier works, Bayesian DNNs focused on the use of Monte Carlo or ensemble based methodologies [50, 51, 9]. However, with the number of parameters in such models growing exponentially larger over recent years, such traditional methods are computationally intractable. As a result, many recent Bayesian deep learning frameworks focus on variational methods that fit a proposal distribution over the true posterior of the model’s parameters. Variational ideas have led to multiple developments including: Bayes by back-prop [52], Bayesian dropout approximation [53] and Stein variational gradient descent [54].

For sampling the posterior of BAR-DenseED, we will use a recently proposed Stochastic Weight Averaging Gaussian (SWAG) [55]. SWAG is an approximate Bayesian method that builds upon the Stochastic Weight Averaging (SWA), an optimization methodology where running averages of model parameters are kept during the stochastic gradient descent (SGD) procedure [56, 57, 58]. SWAG approximates the posterior in two phases:

The model of interest is first trained using traditional machine learning methods to minimize the negative log of the posterior defined in Eq. (21) (equivalent to solving for the MAP estimate). Specifically in this work, we optimize the model using the ADAM [48], an extension of SGD, with exponential learning rate decay. 2. 2.

Once the model has been trained, SGD is ran again at a constant learning rate. During this process, samples of the model’s parameters are collected. The core idea is to use SGD to explore the local support region of the MAP estimate. These SGD iterations can provide useful information about the form of the posterior which can then be used to approximate the posterior density function for full Bayesian inference.

In SWAG, the posterior over the BAR-DenseED parameters is approximated as a Gaussian distribution with $S$ samples of the model’s parameters:

[TABLE]

which is standard when using the Laplace approximation. We note that the noise precision, $\beta$ , has a log-normal posterior approximation to ensure that it is positive. The mean and the covariance are approximated using the model parameters proposed by SGD:

[TABLE]

where $\bm{\theta}_{i}$ are the model parameters at epoch $i$ , and $\bm{D}\in\mathbb{R}^{K\times H}$ is a deviation matrix consisting of the $H$ most recent parameter samples forming a low-rank approximation. $\bm{D}_{i}\in\mathbb{R}^{K}$ is a column of this deviation matrix. This results in a simple sampling method outlined in Algorithm 3 approximating the posterior of the model parameters for a time series prediction. While many other Bayesian approaches exist, we selected to use SWAG for three main reasons: its simplicity in both formulation and implementation, its non-invasive nature to the learning of the neural network and its low computational overhead compared to other Bayesian approaches. These factors are important for the learning of time series problems since learning generally becomes significantly more expensive and difficult.

Of significant importance is the sampling learning rate, which is directly related to both the shape as well as the convergence rate of the posterior. As discussed in [55], this learning rate should be large enough to sufficiently explore the support region of the minima the model has converged to. However, $\eta_{swag}$ should not be too large such that the model potentially jumps to other local minima during the collection of samples to approximate the posterior. This is due to the use of a single mode Gaussian as the approximate density for which multimodal data cannot be handled robustly.

*Remark 5**.*

While vanilla stochastic gradient descent is shown here for simplicity of illustrating the SWAG algorithm, fundamentally, one can use other optimization methods as well to collect SWAG samples such as gradient descent with momentum [55]. Similar to the use of various Markov chain Monte Carlo methods for approximating density functions, different stochastic optimization methods can be used to approximate the density of the posterior of the neural network.

4.3 Predictive Statistics

As previously discussed, predictive statistics in a Bayesian framework are obtained by marginalizing or integrating out the parameters of the model. Given the previous Markov assumption, the predictive distribution can be posed in terms of a single arbitrary time-step. We will approximate the marginalization of the model parameters, often referred to as Bayesian model averaging, using Monte Carlo with $P$ parameter samples:

[TABLE]

where $\bm{\chi}^{*}$ and $\hat{\bm{u}}^{*}$ are the predictive inputs and outputs, respectively. The posterior, $p(\bm{\theta}|\mathcal{S})\equiv p(\textbf{w},ln(\beta)|\mathcal{S})$ , has been approximate by SWAG and is easily sampled from. The predictive expectation can be obtained as follows:

[TABLE]

where the additive output noise is not present due to its zero-mean Gaussian density. The predictive conditional covariance can also be obtained in a similar fashion:

[TABLE]

where $\mathbb{E}\left[\bm{u}^{*}|\bm{\chi}^{*},\mathcal{S}\right]$ has been defined in Eq. (27). To predict an entire time series, each model is sampled at the first time-step and auto-regressed forward in time independently. Each set of parameters sampled from the posterior can be interpreted as an individual particle that is propagated forward in time. Thus as we predict further in time, we should expect the predictive variance of the model to gradually increase.

5 Kuramoto-Sivashinsky Equation

The first physical system that we are interested in is the 1D Kuramoto-Sivashinsky (K-S) equation which is a fourth-order, nonlinear partial differential equation:

[TABLE]

where $\nu$ is referred to as the “hyper-viscosity” and is set to $\nu=1$ for the remainder of this section. The K-S equation is widely known for its chaotic behavior when the size of the periodic domain is sufficiently large (generally $L\geq 50$ ) in which the system becomes a spatio-temporally chaotic attractor [59]. The K-S PDE has attracted great interest as it serves as a prototypical problem for studying complex dynamics with its chaotic regime being weakly turbulent (as opposed to strong turbulence seen in the Navier-Stokes equations) [60, 61]. Several physical systems such as chemical phase turbulence, plasma ion instabilities and flame front instabilities have all seen the K-S equation arise within them [62, 63, 64]. For our problem of interest, we take the domain $L$ to be $[0,22\pi]$ putting the system well within its chaotic regime. The domain is discretized by $96$ uniform cells and time-step is $\Delta t=0.1$ . Two sample responses of the K-S equation for two different initial conditions are illustrated in Fig. 4. During training and testing, we will ignore the initial transient state thus our initial conditions will be already fully developed “turbulence” ( $T\geq 100$ ).

Our goal is for AR-DenseED to predict the chaotic response of the system accurately thus illustrating the potential of this model to predict physical dynamics. In the past, others have attempted to model this system by machine learning methods. Recently, in Pathak et al. [66] reservoir computing was used to predict the K-S system, however the model is trained on the past history of a specific state. Thus the model learns only for a specific initial condition. The recent formulations of physics-informed neural networks in Raissi et al. [15, 39] have been able to work for learning a specific initial condition without training data, however these models have yet to be shown effective as a predictive surrogate.

The AR-DenseED used for the K-S equation consists of a single convolutional block, followed by a dense block, followed by a deconvolutional block resulting in a model with just over $4800$ learnable weights. The two previous states of the system are used as inputs, $\bm{\chi}^{n+1}=\left\{\bm{u}^{n},\bm{u}^{n-1}\right\}$ . Similar to the numerical solver, the time-step size of the model is set to $\Delta t=0.1$ with a spatial discretization of $96$ points. As previously discussed, the negative log of the joint posterior in Eq. (21) is the loss function. For the physics-constrained loss function, the implicit Crank-Nicolson time integration is used and the remaining spatial gradients are discretized as follows:

[TABLE]

where the spatial gradients are approximated using fourth-order accurate finite difference discretizations that are implemented efficiently using convolutional operators. The model was trained for $100$ epochs using $2560$ training scenarios that were generated using a truncated Fourier series with random coefficients discussed in A.1. This Fourier series serves to approximate the physical turbulence of the system, and thus estimating the true distribution of the possible initial states $p(\bm{u}_{0})$ . During test time, we use $200$ test cases of true turbulent initial states of the system obtained from a numerical simulator. This will demonstrate how one can use approximated training scenarios and physics-constrained learning to train a model that can be used on a true realization of the system. The training scenarios were mini-batched with a batch size of $256$ . During training the model was allowed to unroll itself in time up to $1000$ time-steps to allow AR-DenseED to thoroughly explore the turbulent dynamics. Training on a single 1080Ti GPU took approximately $1.5$ wall-clock hours. Additional details on the model and training parameters are discussed in A.

5.1 AR-DenseED Deterministic Predictions

We start with the prediction of the deterministic AR-DenseED model. Predictive results are shown in Fig. 5 for three test initial conditions compared against a numerical solver. All predictions are obtained by only providing the initial state and evolving the system with $1000$ consecutive iterations of the neural network. Overall the results are very impressive and are significant improvements compared to past literature despite only using a single initial state to predict. The model is able to maintain consistency with the numerical solver for between $t=[0,30]$ , but then diverges due to small prediction error that causes the model to shift its response as a result of the systems’ chaotic nature. However, the predicted system remains qualitatively reasonable and stable for even extended times which is a significant advantage of our auto-regressive formulation.

For each test case, we calculate the spatial mean square error (MSE) at each time-step defined as:

[TABLE]

where $N$ , $u$ and $u^{*}$ are the total number of points used to discretize the domain, target value from the numerical simulator and the AR-DenseED prediction, respectively. The mean of this error value is shown in Fig. 6, where we can see the decay in the model accuracy at around $t=20$ until AR-DenseED has fully diverged from the numerical simulator by $t>40$ . However, the predictions in Fig. 5 still appear to be physical despite not matching the numerical simulator. To illustrate that the model predicts physical turbulence, we compare the average energy spectral density in Fig. 7 for a randomly selected test case. Since this is a turbulent statistic, the energy density profile is the same regardless of the particular initial condition. This statistic was obtained by averaging time-steps between $t=[0,500]$ . This means our AR-DenseED is stable in its predictions for at least $5000$ time-steps, far beyond its training time-range. This would not be possible with the traditional fully connected neural network approach for solving PDEs. The AR-DenseED is accurate with the simulation results for the larger wavelengths where the majority of the energy is concentrated. Additionally the AR-DenseED is able to correctly generate turbulence with the greatest energy at a similar wavelength as the simulation (Hz $\in[0.1,0.15]$ ). While the model and simulation results begin to deviate from the numerical solution for smaller wavelengths, the energy decays at these higher frequencies meaning that the absolute error between the model and simulation is significantly less. Thus we confidently conclude that the AR-DenseED has truly learned to predict physical turbulence of the K-S equation.

The average prediction wall-clock time of both the spectral ETDRK4 scheme versus the AR-DenseED are given in Table 1. In this situation, the AR-DenseED fails to be a computational effective surrogate model compared to the highly-efficient spectral method which is able to take advantage of fast Fourier transform. However, the K-S system was able to provide an excellent illustration of how AR-DenseED can model complex, non-linear, chaotic systems. We will show in the following sections how AR-DenseED can be computationally more efficient than traditional numerical solvers.

5.2 BAR-DenseED Probabilistic Predictions

To approximate the posterior with SWAG, $100$ samples of the model’s parameters were collected. This yielded reasonably diverse but accurate models and was found to be enough samples for $\bm{\theta}_{SWA}$ to converge. During this period the learning rate was lowered to $1e-10$ for the neural network weights and $1e-6$ for the additive output noise. While these learning rates may appear too small to sufficiently explore the local loss surface, for an auto-regressive model this was discovered to be a necessity as even very small changes to the parameters can have profound response changes during test time. Larger learning rates for SWAG sampling were found to produce models that were unstable.

We plot eight samples from the approximate posterior in Fig. 8 for a single test case with the target result in the top left. Due to the chaotic nature of the K-S system or the so called “butterfly effect”, samples from the posterior start with a similar response up for $t<40$ and deviate for larger time values producing completely unique responses. Similar to the deterministic case, we calculate the mean squared error defined in Eq. (37) using the expected predictive response using $30$ model samples for $200$ test cases and plot the mean error value at each time-step in Fig. 6. Although, it appears BAR-DenseED performs much better than AR-DenseED for later time-steps this is due to the field being averaged out to around zero. Thus the predictive performance between the deterministic and Bayesian model is essentially equivalent in this case. We can propagate this uncertainty to the averaged spectral energy density illustrated in Fig. 7 where $30$ model samples are used to calculate the spectral density. At the largest wave lengths ( $Hz<0.3$ ), we can see that the model has reasonable error bars that are able to capture the true solution. For smaller wave-lengths, the predicted energy density appears to be consistent between the samples.

6 1D Viscous Burgers’ Equation

Let us now consider the 1D viscous Burgers’ equation in a periodic domain:

[TABLE]

where $u$ is the velocity and $\nu$ is the viscosity. The Burgers’ equation is a fundamental PDE that arises in multiple areas ranging from fluid dynamics to traffic flow. It is most recognized for its characteristic shock formations [68]. While cases of the 1D Burgers’ equation have been recovered by machine learning models in the past [15], ultimately these have been for relatively simple initial conditions consisting of a single shock. Here, we would like to model much more complex dynamics by having a variable initial condition that contains multiple waves. Consider a domain $x\in[0,1]$ with a constant viscosity of $\nu=0.0025$ and the random initial condition given by a Fourier series with random coefficients:

[TABLE]

where $a_{l},b_{l}\sim\mathcal{N}(0,1)$ , $L=4$ and $c\sim\mathcal{U}(-1,1)$ . In [69], the authors modeled a simpler random initial condition for the 1D viscous Burgers’ system using a LSTM based model with some success. However, this work reduced the complexity of the problem system by learning a reduced-order model rather than the true system. Simulated system responses for several initial conditions are shown in Fig. 9. We can see that the underlying dynamics of the system are fairly complex due to multiple shocks forming and then combining at later time-steps. Shocks that intersect then combine and move in a different trajectory, making this a difficult system to predict accurately.

The auto-regressive model used for the 1D Burgers’ system is similar to the one used for the K-S system in Section 5 with a few modifications. For this system the five previous time-steps are used as inputs, $\bm{\chi}^{n+1}=\left\{\bm{u}^{n},\bm{u}^{n-1},\ldots,\bm{u}^{n-4}\right\}$ resulting in about $13000$ learnable weights. The time-step value of the model is $\Delta t=0.005$ with a spatial discretization of $512$ points. This places the CFL number of the model well above one based on the maximum potential velocity for the specified random initial condition. Again the negative log of the joint posterior in Eq. (21) is the loss function with the implicit Crank-Nicolson time integration. The remaining spatial gradients are discretized as follows:

[TABLE]

where the spatial gradients are approximated using second-order accurate approximations that are implemented efficiently using convolutional operators. The model was trained for $100$ epochs with $2560$ training scenarios randomly sampled from Eq. (40) and allowed to unroll a maximum of $200$ time-steps from its initial state. Another $200$ samples from Eq. (40) are used as a test set for assessing the models performance. Additional details on the model and training can be found in B.

6.1 AR-DenseED Deterministic Predictions

During testing we use the trained AR-DenseED to predict $400$ time-steps from its initial state. This means that half of its prediction ( $t>1.0$ ) is extrapolation beyond the time range used during training. Four test cases are plotted in Fig. 10, from which we can see that the AR-DenseED is able to predict this system accurately without any training data. The target response is a high-fidelity finite element method (FEM) simulation at time-step size $\Delta t=0.001$ , which is five times smaller than our surrogate model. Overall, the AR-DenseED is able to accurately predict the shock formations and intersections with very distinct shock discontinuities. In our past work [3], we have found that convolutional neural networks like AR-DenseED can predict very sharp features much better than fully-connected models. Additionally, we can see the reason why this system is difficult for surrogate modeling as one slight miscalculation in the shock intersection can result in compounding error. This is illustrated in Fig. 10 where in some of the test cases the model gives a good prediction but a slight miscalculation in the shock intersection results in a growing error. The excellent extrapolation capabilities of AR-DenseED are also shown for which the model is able to yield accurate predictions far beyond its initial training range.

Now we consider the full $200$ test cases with target solutions provided by the high-fidelity FEM simulation. For each test case, we calculate the spatial mean square error (MSE) defined in Eq. (37). In Fig. 11, we plot the mean and median of the MSE for the entire test set. We can see an initial spike in the error during the initial shock formations/intersections that then decays as the system also decays. However, the MSE can be slightly misleading to the actual quality of the prediction for this system since a small deviation in shock trajectory can potentially yield a growing error. Thus, we also compute the energy square error (ESE) for a 1D domain:

[TABLE]

which instead captures the discrepancy of the total energy, $u^{2}/2$ , within the domain, making this metric invariant to shock location. Similarly, we plot the mean and median of the ESE for the $200$ test cases in Fig. 11. For the ESE, we also see an initial spike followed by stable performance until the extrapolation region where the error begins to grow. This behavior is to be expected as the model moves further from the initial training range. Additionally, for both plots, we note that the mean error is consistently higher than the median which is a clear indication of outlier test cases that perform extremely poorly compared to the majority. Unfortunately, this is a core drawback of the auto-regressive approach; if the initial prediction is poor this error will only grow as time progresses. Finally, we compare the prediction computational cost of this surrogate model with both FEM and finite difference method (FDM) using fourth-order Runge-Kutta time integration in Table 2. AR-DenseED is significantly computationally cheaper than the traditional methods, making it a potentially useful surrogate.

6.2 BAR-DenseED Probabilistic Predictions

For the posterior approximation, $90$ samples of the networks parameters were collected using a learning rate of $4e-8$ . Several samples from the posterior are illustrated in Fig. 12 where slight differences in the predictions in earlier times change the final location of the wave at later times. The predictive expectation and variance computed using $30$ model samples is plotted for four test cases in Fig. 13. We can clearly see that the bulk of the variance in the predictions occurs precisely where the shocks form/intersect as expected. Similar to the deterministic model, we also plot the mean squared error and energy squared error of a test set of $200$ cases using the expected prediction of BAR-DenseED in Fig. 11. The Bayesian framework is able to generally out perform the deterministic case. One of the main reasons for this increase in accuracy is the reduction of outliers where AR-DenseED would yield an unsatisfactory prediction for only a small percentage of test cases. Due to the model averaging in the predictive expectation, BAR-DenseED is able to more robustly handle these test cases where some individual predictions may be very poor.

To better understand the uncertainty of BAR-DenseED and how it changes as the time series progresses, we plot several instantaneous profiles for two randomly selected test cases in Figs. 14 and 15. The first three profiles at $t=0.10$ , $t=0.25$ and $t=0.75$ are within the time-range that was used during training. The last profile at $t=1.50$ is considered extrapolation as it lies outside the training time-range. In addition to the BAR-DenseED profile, the predicted solutions of the numerical solvers discussed in Table 2 are also shown. We note that there is a clear prediction discrepancy between the FEM and FDM solutions, largely due to different numerical discretizations. For this system, we hold the FEM solution as the higher accuracy method.

From both test cases, we can observe several important trends: the first is that for the earlier times the model compares well with the numerical solvers. Second, as the shocks form, we can notice large spikes in the standard deviation at these locations which corresponds to the behavior seen in Fig. 13. Finally, we see the range of the error bars increases as the model begins to extrapolate indicating the model is less confident as it moves farther from its training range. However, it is clear that in the extrapolated regions the model is still aware of the correct structure with the error bars capturing the true solution.

7 2D Coupled Burgers’ Equation

Lastly, we will consider the 2D coupled Burgers’ system:

[TABLE]

which when expanded into its components takes the following form:

[TABLE]

where $\nu$ is the viscosity of the system which will be held at $\nu=0.005$ and the domain size set to $\left\{x,y\right\}\in[0,1]$ . $u$ and $v$ are the $x$ and $y$ velocity components, respectively. The 2D coupled Burgers’ equation is an excellent benchmark PDE due to both its non-linear term as well as diffusion operator, making it much more complex than the standard advection or diffusion equations. The 2D coupled Burgers’ belongs to a much broader class of PDEs that are related to various physical problems including shock wave propagation in viscous fluids, turbulence, super-sonic flows, acoustics, sedimentation and airfoil theory. Given its similar form, the coupled Burgers’ equation is often regarded as an essential stepping-stone to the full Navier-Stokes equations [71, 72].

As in our previous examples, we are interested in surrogate modeling for various initial conditions. We will initialize the field using a truncated Fourier series with random coefficients:

[TABLE]

where $\bm{a}_{ij},\bm{b}_{ij}\sim\mathcal{N}(0,\bm{I}_{2})$ , $L=4$ and $\bm{c}\sim\mathcal{U}(-1,1)\in\mathbb{R}^{2}$ . Several of these randomly generated initial conditions are illustrated in Fig. 16. While these initial conditions may appear similar, the evolution of the systems results in the forming of many complex and unique structures. To illustrate this, we plot several time-steps two FEM simulations of different initial conditions in Fig. 17. As the system develops, we can see very distinct structures forming as waves form, mix, interact and dissipate with each other. This is why the 2D coupled Burgers’ system is a difficult system to model and serves as an excellent benchmark for our proposed surrogate.

The AR-DenseED model used for the 2D coupled Burgers’ equations is the largest of the examples shown in this work with about $72000$ learnable parameters. However, in the scope of the deep learning field over the past several years this network is still light weight. Both the $x$ and $y$ velocity components are predicted by the same model in the form of two output channels. Similarly both velocity components from three previous time-steps are used as inputs, $\bm{\chi}^{n+1}=\left\{\bm{u}^{n},\bm{v}^{n},\bm{u}^{n-1},\bm{v}^{n-1},\bm{u}^{n-2},\bm{v}^{n-2}\right\}$ , resulting in six input channels. The time-step value of the model is $\Delta t=0.005$ with a spatial discretization of $64\times 64$ points. Again the negative log of the joint posterior in Eq. (21) is the objective function with the implicit Crank-Nicolson time-integrator and other spatial gradients of the physics-constrained loss being discretized as:

[TABLE]

where the spatial gradients are approximated using Sobel filter 2D convolutions which are analogous to smoothed second-order accurate finite difference approximations [73]. Using the Sobel filter was found to increase the stability of training and significantly reduce spurious oscillations in the model’s predictions. The model was trained on $5120$ training scenarios sampled from Eq. (50) with a mini-batch size of $128$ . AR-DenseED was optimized for $100$ epochs and allowed to unroll a maximum of $100$ time-steps from its initial state. Training on a single 1080Ti GPU required $9$ wall-clock hours. To assess the performance of the model, another set of $200$ samples from Eq. (50) was used as testing scenarios. Additional details on the model and training can be found in C.

7.1 AR-DenseED Deterministic Predictions

For the 2D couple Burgers’ system, the trained AR-DenseED is used to predict $200$ time-steps from the initial state at $t=0$ . Similar to the 1D Burgers’ test case, this means that half of this predicted region ( $t>0.5$ ) is extrapolation beyond the time range that the model was trained on. The target response is a high-fidelity FEM simulation with a discretization of $128\times 128$ interpolated to a $64\times 64$ grid. The AR-DenseED’s predictions are shown for two test cases in Figs. 18 and 19 in which several instantaneous time-steps are plotted. Overall, the model does a remarkable job accurately predicting the complex structures and discontinuities of this system even into the extrapolation region. As expected the bulk of the error is concentrated on shock interfaces and wave fronts, however, the model’s extremely good predictive capability is clear.

The mean squared error defined in Eq. (37) as well as the energy squared error defined in Eq. (45), both generalized to two dimensions, are evaluated for each time-step for the $200$ test scenarios. The mean and the median of these error values for the entire test set are plotted in Fig. 20. AR-DenseED is shown to have very stable prediction error within the training time region. When AR-DenseED performs extrapolatory predictions at $t>0.5$ , we can see that the mean error begins to grow faster than the median suggesting that several outlier test cases have very poor predictions. We also compared the prediction computational cost of AR-DenseED for this 2D system again a set of FEM solutions of various spatial discretizations in Table 3. Here, we can see that the AR-DenseED is able to far outperform the FEM simulations with a wall-clock time that is significantly less than all simulations by several orders of magnitude. When one takes full advantages of a GPU hardware accelerator, this speed up becomes even larger. Additionally this neglects the ability of the neural network to easily compute multiple simulations in a single batch, making its predictive efficiency even greater. This illustrates the true potential power of these deep convolutional surrogate models.

7.2 BAR-DenseED Probabilistic Predictions

To approximate the posterior with SWAG, $90$ samples of the network’s parameters were collected with a learning rate of $3e-8$ for the neural network parameters and $3e-5$ for the output noise $\beta$ . Predictions of several posterior samples at times $t=0.1$ and $t=0.5$ are shown in Figs. 21 and 22, respectively. As expected, the variance of the samples increases significantly as time progresses. This is also reflected in the BAR-DenseED prediction contours for two test cases in Figs. 23 and 24 in which the predictive expectation and variance computed using $30$ model samples are shown for several time-steps. Again we see that the majority of the uncertainty is concentrated on the leading face of the shocks/waves which is precisely where we would expect it for a well calibrated model. To more clearly illustrate the uncertainty estimates of the probabilistic model, velocity profiles of both velocity components are plotted in Fig. 25 for a randomly selected test case. Overall, we can see that the predictive standard deviation is able to capture the true solution for almost all times. Finally, the mean squared error and energy squared error are also plotted using the predictive expectation of BAR-DenseED in Fig. 20. The Bayesian model is able to have smaller and more stable prediction error even when extrapolating beyond the training time range.

8 Conclusion

In this work, we have presented a deep auto-regressive convolutional neural network model that can be used to learn and surrogate model the dynamics of transient PDEs. To train this model, physics-constrained deep learning is used where the governing equations of the system of interest are used to formulate a loss function. This allows the model to be trained with zero (output) training data. Additionally, we proposed a Bayesian probabilistic framework built on top of this deep learning model to allow for uncertainty quantification (including both epistemic and aleatoric uncertainty). This model was implemented for three PDE systems: the first is the chaotic Kuramoto-Sivashinsky equation for which the model was used to accurately reproduce physical turbulent statistics. The second is the 1D Burgers’ equation at a low viscosity where the model was able to successfully predict multi-shock wave formation and intersections. At last is the 2D coupled Burgers’ equations for which the model was able to accurately predict the complex wave dynamics of this system. Overall, the proposed model showed exceptional predictive accuracy and was able to successfully extrapolate to predict outside the time-range used when training.

Although fully connected networks are frequently used to solve PDE systems due to their analytical and mesh-less benefits, the performance of convolutional neural networks for solving and surrogate modeling of PDEs is exceptional. In this work, we have further shown that convolutional neural networks can be used effectively in physics-constrained learning and build surrogate models that are order of magnitudes faster than state-of-the-art numerical solvers. A particular draw-back of convolutional neural networks is the requirement that both spatial and temporal derivatives be discretized, which opens the model up to the challenges that are faced in traditional numerical algorithms such as truncation error, oscillations, convergence criterion and more. However, one can also use the deep repository of techniques and tricks developed by the numerical analysis community to address these potential issues. This would be an interesting avenue to investigate as one could incorporate methods such as flux limiters or non-oscillatory schemes to yield predictions that have similar numerical benefits. One could also consider higher-order derivatives and their impact on accuracy versus training stability.

The most obvious path to further develop this model is to implement it for more complex and larger systems. This could include systems such as the Navier-Stokes equations, coupled transport through porous media, combustion and more. However, there are still significant challenges that will need to be addressed. The most important is training cost; with any time series problem training a deep learning model becomes exponentially more difficult and more costly. Although our model is able to be trained in a very reasonable amount of time given the complexity of the physical systems modeled as well as the hardware used, improving the training of the model will still be an important area of study. This may involve the use of network architectures considered in recent neural language processing literature such as self-attention mechanisms. Another potential extension is the incorporation of data and physics-constrained learning to create this hybrid learning framework. Specifically for time series, one may not have the system state at every time interval that is desired. Physics-constrained learning could be an answer to help bridge this challenge of predicting at fine resolutions with sparse data.

Acknowledgements

The authors acknowledge support from the Defense Advanced Research Projects Agency (DARPA) under the Physics of Artificial Intelligence (PAI) program (contract HR $00111890034$ ). The work of NG is also supported by the National Science Foundation (NSF) Graduate Research Fellowship Program grant No. DGE- $1313583$ . Additional computing resources were provided by the University of Notre Dame’s Center for Research Computing (CRC), NSF supported “Extreme Science and Engineering Discovery Environment” (XSEDE) on the Bridges and Bridges-GPU cluster through research allocation No. TG-CTS $180038$ and by the AFOSR Office of Scientific Research through the DURIP program.

Appendix A Kuramoto-Sivashinsky

The following appendix discusses details related to the model used to predict in the Kuramoto-Sivashinsky equation in Section 5. For this system, a small dense encoder-decoder model was used as depicted in Fig. 26. The three components of this model are the encoding convolution, the dense block and decoding block. The resulting model encodes a given 1D input $\left\{\bm{u}^{n},\bm{u}^{n-1}\right\}\in\mathbb{R}^{d}$ to a set of latent variables that are of dimensionality $\bm{z}_{i}\in\mathbb{R}^{d/2}$ . These latent variables are then decoded to the prediction $\bm{u}^{n+1}\in\mathbb{R}^{d}$ . Examples of a dense block and decoding block are shown in Fig. 27 and Fig. 28 originally proposed in Zhu and Zabaras [2].

While this model is relatively small, we found that smaller models were more stable in training and had the additional benefit of being faster. To help with learning periodic boundaries, circular padding was used for all convolutions within the model. The model was optimized with ADAM [48] with an initial learning rate of $1e-3$ . It was found that decaying the learning rate exponentially yielded the most stable, consistent and accurate results. Additional model training parameters can be reference in Table 4.

A.1 Training Initial States

To train the model, we need to provide it a set of initial states or training scenarios to start at. Simulator data could be used for this, however to stay consistent with the zero training data philosophy, we chose to use a truncated Fourier series with random coefficients. We propose the use of:

[TABLE]

where $a_{0}$ is the a priori mean amplitude estimate (set to $2.5$ ), $L$ is the domain size and $k_{0}$ is the number of unstable modes which can be estimated given the domain length such that $k_{0}=L/(2\sqrt{2}\pi)$ [76]. This function is designed to provide a physically realizable initial condition for AR-DenseED to explore the physics of the K-S system.

Appendix B 1D Viscous Burgers’ System

The following appendix discusses details related to the model used to predict the 1D Burgers’ equation in Section 6. Of the three systems present in this paper, the 1D viscous Burgers’ system was found to be the most difficult to train. This is most likely due to the sharp gradients in this system which carries over to the loss function resulting in exploding gradients during optimization. The model for this system, depicted in Fig. 29, is approximately three times as large as the model used for the K-S system. Similar to the K-S model, the 1D-Burgers model encodes a given 1D input $\left\{\bm{u}^{n},\bm{u}^{n-1}...,\bm{u}^{n-4}\right\}\in\mathbb{R}^{d}$ to a set of latent variables that are of dimensionality $\bm{z}_{i}\in\mathbb{R}^{d/2}$ . These latent variables are then decoded to the prediction $\bm{u}^{n+1}\in\mathbb{R}^{d}$ . More time-steps were used in the model input compared to the K-S system to increase learning stability. Interestingly, if the model was too large, SWAG would fail to approximate a good posterior regardless of the learning rate resulting in sampled models being unstable during prediction. Additional training parameters are listed in Table 5.

Appendix C 2D Coupled Burgers’ System

The following appendix discusses details related to the model used to predict the 2D coupled Burgers’ equation in Section 7. As shown in Figure 30, the model used for the 2D Burgers’ system is very similar to the other two test cases. The key difference is that the model now predicts two values: the $x$ and $y$ velocity components. Similarly the model uses both velocity components as inputs. Thus when three previous time-steps are used in $\bm{\chi}^{n+1}$ the model has six input channels. This model encodes a given 2D input $\left\{\bm{u}^{n},\bm{v}^{n},\bm{u}^{n-1},\bm{v}^{n-1},\bm{u}^{n-2},\bm{v}^{n-2}\right\}\in\mathbb{R}^{d\times d}$ to a set of latent variables that are of dimensionality $\bm{z}_{i}\in\mathbb{R}^{d/2\times d/2}$ . These latent variables are then decoded to the prediction $\left\{\bm{u}^{n+1},\bm{v}^{n+1}\right\}\in\mathbb{R}^{d\times d}$ . Although this system has some of the most complex dynamics, to our surprise we found that it was the easiest to train with significantly better stability and consistency than the other 1D systems. We largely attribute this to the use of Sobel filters to approximate the gradients which help dampen higher frequencies preventing oscillations. Additional training parameters are listed in Table 6.

Bibliography76

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Le Cun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436. doi:10.1038/nature 14539 . · doi ↗
2[2] Y. Zhu, N. Zabaras, Bayesian deep convolutional encoder-decoder networks for surrogate modeling and uncertainty quantification , Journal of Computational Physics 366 (2018) 415 – 447. doi:10.1016/j.jcp.2018.04.018 . URL http://www.sciencedirect.com/science/article/pii/S 0021999118302341 · doi ↗
3[3] Y. Zhu, N. Zabaras, P.-S. Koutsourelakis, P. Perdikaris, Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data , Journal of Computational Physics 394 (2019) 56 – 81. doi:10.1016/j.jcp.2019.05.024 . URL http://www.sciencedirect.com/science/article/pii/S 0021999119303559 · doi ↗
4[4] R. K. Tripathy, I. Bilionis, Deep UQ: Learning deep neural network surrogate models for high dimensional uncertainty quantification , Journal of Computational Physics 375 (2018) 565 – 588. doi:10.1016/j.jcp.2018.08.036 . URL http://www.sciencedirect.com/science/article/pii/S 0021999118305655 · doi ↗
5[5] C. Yang, X. Yang, X. Xiao, Data-driven projection method in fluid simulation , Computer Animation and Virtual Worlds 27 (3-4) (2016) 415–424. ar Xiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cav.1695 , doi:10.1002/cav.1695 . URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cav.1695 · doi ↗
6[6] N. Geneva, N. Zabaras, Quantifying model form uncertainty in Reynolds-averaged turbulence models with Bayesian deep neural networks , Journal of Computational Physics 383 (2019) 125 – 147. doi:10.1016/j.jcp.2019.01.021 . URL http://www.sciencedirect.com/science/article/pii/S 0021999119300464 · doi ↗
7[7] M. Schöberl, N. Zabaras, P.-S. Koutsourelakis, Predictive collective variable discovery with deep Bayesian models , The Journal of Chemical Physics 150 (2) (2019) 024109. doi:10.1063/1.5058063 . URL https://doi.org/10.1063/1.5058063 · doi ↗
8[8] D. J. Mac Kay, A practical Bayesian framework for backpropagation networks , Neural computation 4 (3) (1992) 448–472. doi:10.1162/neco.1992.4.3.448 . URL https://doi.org/10.1162/neco.1992.4.3.448 · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Modeling the Dynamics of PDE Systems with Physics-Constrained Deep Auto-Regressive Networks

Abstract

keywords:

1 Introduction

2 Problem Definition

2.1 Surrogate Modeling of Dynamical Systems

Definition 2.1**.**

Definition 2.2**.**

3 Auto-regressive Dense Encoder-Decoder

Remark 1*.*

3.1 Physics-Constrained Loss Function

Remark 2*.*

Proposition 1*.*

3.2 AR-DenseED Training

4 Bayesian AR-DenseED

4.1 Posterior Formulation

Remark 3*.*

Remark 4*.*

4.2 Posterior Approximation

Remark 5*.*

4.3 Predictive Statistics

5 Kuramoto-Sivashinsky Equation

5.1 AR-DenseED Deterministic Predictions

5.2 BAR-DenseED Probabilistic Predictions

6 1D Viscous Burgers’ Equation

6.1 AR-DenseED Deterministic Predictions

6.2 BAR-DenseED Probabilistic Predictions

7 2D Coupled Burgers’ Equation

7.1 AR-DenseED Deterministic Predictions

7.2 BAR-DenseED Probabilistic Predictions

8 Conclusion

Acknowledgements

Appendix A Kuramoto-Sivashinsky

A.1 Training Initial States

Appendix B 1D Viscous Burgers’ System

Appendix C 2D Coupled Burgers’ System

Definition 2.1.

Definition 2.2.

*Remark 1**.*

*Remark 2**.*

*Proposition 1**.*

*Remark 3**.*

*Remark 4**.*

*Remark 5**.*