Modeling of Missing Dynamical Systems: Deriving Parametric Models using   a Nonparametric Framework

Shixiao W. Jiang; John Harlim

arXiv:1905.08082·stat.ME·July 10, 2020

Modeling of Missing Dynamical Systems: Deriving Parametric Models using a Nonparametric Framework

Shixiao W. Jiang, John Harlim

PDF

TL;DR

This paper introduces a nonparametric, kernel-based framework for modeling missing dynamics in systems, which can produce parametric models and accurately estimate autocovariance functions, applicable to complex nonlinear dynamics.

Contribution

It develops a nonparametric modeling approach using RKHS that can derive parametric models and handle systems with varying time scales.

Findings

01

Accurately estimates autocovariance in linear Gaussian systems.

02

Replicates missing nonlinear dynamical terms in high-dimensional systems.

03

Aligns with classical averaging theory for fast-evolving dynamics.

Abstract

In this paper, we consider modeling missing dynamics with a nonparametric non-Markovian model, constructed using the theory of kernel embedding of conditional distributions on appropriate Reproducing Kernel Hilbert Spaces (RKHS), equipped with orthonormal basis functions. Depending on the choice of the basis functions, the resulting closure model from this nonparametric modeling formulation is in the form of parametric model. This suggests that the success of various parametric modeling approaches that were proposed in various domains of applications can be understood through the RKHS representations. When the missing dynamical terms evolve faster than the relevant observable of interest, the proposed approach is consistent with the effective dynamics derived from the classical averaging theory. In the linear Gaussian case without the time-scale gap, we will show that the proposed…

Equations145

d x

d x

d y

\overset{a}{^} (x, z_{t}) := E [a (x, Y) ∣ z_{t}] \hat{B} (x, z_{t}) := E [b (x, Y) b (x, Y)^{⊤} ∣ z_{t}],

\overset{a}{^} (x, z_{t}) := E [a (x, Y) ∣ z_{t}] \hat{B} (x, z_{t}) := E [b (x, Y) b (x, Y)^{⊤} ∣ z_{t}],

\overset{x}{^}_{t + 1} = \overset{x}{^}_{t} + \int_{t τ}^{(t + 1) τ} \overset{a}{^} (\overset{x}{^} (s), \hat{z}_{t}) d s + \int_{t τ}^{(t + 1) τ} \hat{B} (\overset{x}{^} (s), \hat{z}_{t})^{1/2} d W_{s},

\overset{x}{^}_{t + 1} = \overset{x}{^}_{t} + \int_{t τ}^{(t + 1) τ} \overset{a}{^} (\overset{x}{^} (s), \hat{z}_{t}) d s + \int_{t τ}^{(t + 1) τ} \hat{B} (\overset{x}{^} (s), \hat{z}_{t})^{1/2} d W_{s},

K f (y) = \int_{Y} K (y, y^{'}) f (y^{'}) q^{- 1} (y^{'}) d y^{'}, f \in L^{2} (Y, q^{- 1}) .

K f (y) = \int_{Y} K (y, y^{'}) f (y^{'}) q^{- 1} (y^{'}) d y^{'}, f \in L^{2} (Y, q^{- 1}) .

K (y, y^{'}) = k = 1 \sum \infty λ_{k} ψ_{k} (y) q (y) ψ_{k} (y^{'}) q (y^{'}) .

K (y, y^{'}) = k = 1 \sum \infty λ_{k} ψ_{k} (y) q (y) ψ_{k} (y^{'}) q (y^{'}) .

f (y) = ⟨ f, K (y, \cdot) ⟩_{H} = k = 1 \sum \infty \frac{f _{k} λ _{k} ψ _{k} ( y ) q ( y )}{λ _{k}} = k = 1 \sum \infty f_{k} ψ_{k} (y) q (y),

f (y) = ⟨ f, K (y, \cdot) ⟩_{H} = k = 1 \sum \infty \frac{f _{k} λ _{k} ψ _{k} ( y ) q ( y )}{λ _{k}} = k = 1 \sum \infty f_{k} ψ_{k} (y) q (y),

p (y ∣ z) = k = 1 \sum \infty ⟨ p (\cdot ∣ z), ψ_{k} q ⟩_{L^{2} (Y, q^{- 1})} ψ_{k} (y) q (y) = k = 1 \sum \infty ⟨ p (\cdot ∣ z), ψ_{k} ⟩_{L^{2} (Y)} ψ_{k} (y) q (y),

p (y ∣ z) = k = 1 \sum \infty ⟨ p (\cdot ∣ z), ψ_{k} q ⟩_{L^{2} (Y, q^{- 1})} ψ_{k} (y) q (y) = k = 1 \sum \infty ⟨ p (\cdot ∣ z), ψ_{k} ⟩_{L^{2} (Y)} ψ_{k} (y) q (y),

⟨ p (\cdot ∣ z), ψ_{k} ⟩_{L^{2} (Y)} = E_{Y ∣ z} [ψ_{k} (Y)] = l = 1 \sum \infty [C_{Y Z} C_{Z Z}^{- 1}]_{k l} φ_{l} (z),

⟨ p (\cdot ∣ z), ψ_{k} ⟩_{L^{2} (Y)} = E_{Y ∣ z} [ψ_{k} (Y)] = l = 1 \sum \infty [C_{Y Z} C_{Z Z}^{- 1}]_{k l} φ_{l} (z),

[C_{Y Z}]_{k s} = E_{Y Z} [ψ_{k} (Y) \otimes φ_{s} (Z)], [C_{Z Z}]_{s l} = E_{Z Z} [φ_{s} (Z) \otimes φ_{l} (Z)] .

[C_{Y Z}]_{k s} = E_{Y Z} [ψ_{k} (Y) \otimes φ_{s} (Z)], [C_{Z Z}]_{s l} = E_{Z Z} [φ_{s} (Z) \otimes φ_{l} (Z)] .

p (y ∣ z) = k, l = 1 \sum \infty ψ_{k} (y) q (y) [C_{Y Z} C_{Z Z}^{- 1}]_{k l} φ_{l} (z) .

p (y ∣ z) = k, l = 1 \sum \infty ψ_{k} (y) q (y) [C_{Y Z} C_{Z Z}^{- 1}]_{k l} φ_{l} (z) .

[C_{Y Z}]_{k s} \approx \frac{1}{N} i = 1 \sum N ψ_{k} (y_{i}) φ_{s} (z_{i}), [C_{Z Z}]_{s l} \approx \frac{1}{N} i = 1 \sum N φ_{s} (z_{i}) φ_{l} (z_{i}) .

[C_{Y Z}]_{k s} \approx \frac{1}{N} i = 1 \sum N ψ_{k} (y_{i}) φ_{s} (z_{i}), [C_{Z Z}]_{s l} \approx \frac{1}{N} i = 1 \sum N φ_{s} (z_{i}) φ_{l} (z_{i}) .

\overset{a}{^} (\overset{x}{^}, \hat{z}_{t})

\overset{a}{^} (\overset{x}{^}, \hat{z}_{t})

\hat{B} (\overset{x}{^}, \hat{z}_{t})

E [a ∣ \hat{z}_{t}] = k, l = 1 \sum \infty \int_{Y} a (\overset{x}{^}, y) ψ_{k} (y) q (y) d y [C_{Y Z} C_{Z Z}^{- 1}]_{k l} φ_{l} (\hat{z}_{t}) = l = 1 \sum \infty A_{l} (\overset{x}{^}) φ_{l} (\hat{z}_{t}),

E [a ∣ \hat{z}_{t}] = k, l = 1 \sum \infty \int_{Y} a (\overset{x}{^}, y) ψ_{k} (y) q (y) d y [C_{Y Z} C_{Z Z}^{- 1}]_{k l} φ_{l} (\hat{z}_{t}) = l = 1 \sum \infty A_{l} (\overset{x}{^}) φ_{l} (\hat{z}_{t}),

A_{l} (\overset{x}{^})

A_{l} (\overset{x}{^})

\frac{1}{N} k = 1 \sum N ψ_{k} (y_{i}) ψ_{k} (y_{j}) = δ_{ij},

\frac{1}{N} k = 1 \sum N ψ_{k} (y_{i}) ψ_{k} (y_{j}) = δ_{ij},

U = ZV Σ^{- 1},

U = ZV Σ^{- 1},

E (a ∣ \hat{z}_{t}) = i = 1 \sum N s = 1 \sum L a (\overset{x}{^}, y_{i}) φ_{s} (z_{i}) φ_{s} (\hat{z}_{t}),

E (a ∣ \hat{z}_{t}) = i = 1 \sum N s = 1 \sum L a (\overset{x}{^}, y_{i}) φ_{s} (z_{i}) φ_{s} (\hat{z}_{t}),

E (Y ∣ \hat{z}_{t}) = Y^{⊤} U U_{new}^{⊤},

E (Y ∣ \hat{z}_{t}) = Y^{⊤} U U_{new}^{⊤},

E (Y ∣ \hat{z}_{t}) = Y^{⊤} Z V Σ^{- 1} Σ^{- 1⊤} V^{⊤} (\hat{z}_{t} - \overset{ˉ}{z})^{⊤} = (Y^{⊤} Z) (Z^{⊤} Z)^{- 1} (\hat{z}_{t} - \overset{ˉ}{z})^{⊤} .

E (Y ∣ \hat{z}_{t}) = Y^{⊤} Z V Σ^{- 1} Σ^{- 1⊤} V^{⊤} (\hat{z}_{t} - \overset{ˉ}{z})^{⊤} = (Y^{⊤} Z) (Z^{⊤} Z)^{- 1} (\hat{z}_{t} - \overset{ˉ}{z})^{⊤} .

d x

d x

d y

A_{\epsilon}=\left(\begin{array}[]{cc}a_{11}&a_{12}\\ \frac{1}{\epsilon}a_{21}&\frac{1}{\epsilon}a_{22}\end{array}\right)

A_{\epsilon}=\left(\begin{array}[]{cc}a_{11}&a_{12}\\ \frac{1}{\epsilon}a_{21}&\frac{1}{\epsilon}a_{22}\end{array}\right)

a = a_{11} - a_{12} a_{22}^{- 1} a_{21} < 0, a_{22} < 0,

a = a_{11} - a_{12} a_{22}^{- 1} a_{21} < 0, a_{22} < 0,

d \overset{x}{^}_{t} = a \overset{x}{^}_{t} d t + σ_{x} d W_{x},

d \overset{x}{^}_{t} = a \overset{x}{^}_{t} d t + σ_{x} d W_{x},

p (x, y, t) = \overset{ρ}{^} (x, t) ρ_{\infty} (y; x) + O (ϵ), t \geq 0,

p (x, y, t) = \overset{ρ}{^} (x, t) ρ_{\infty} (y; x) + O (ϵ), t \geq 0,

p (y ∣ x) := \frac{p _{e q} ( x , y )}{\int _{Y} p _{e q} ( x , y ) d y} = \frac{ρ ^ _{e q} ( x ) ρ _{\infty} ( y ; x ) + O ( ϵ )}{ρ ^ _{e q} ( x ) + O ( ϵ )} = ρ_{\infty} (y; x) + O (ϵ) .

p (y ∣ x) := \frac{p _{e q} ( x , y )}{\int _{Y} p _{e q} ( x , y ) d y} = \frac{ρ ^ _{e q} ( x ) ρ _{\infty} ( y ; x ) + O ( ϵ )}{ρ ^ _{e q} ( x ) + O ( ϵ )} = ρ_{\infty} (y; x) + O (ϵ) .

E [Y ∣ x]

E [Y ∣ x]

E [Y^{2} ∣ x]

a \overset{x}{^}_{t} = \overline{a} \overset{x}{^}_{t} = \int_{Y} (a_{11} \overset{x}{^}_{t} + a_{12} y) p (y ∣ \overset{x}{^}_{t}) d y = (a_{11} + a_{12} s_{21} s_{11}^{- 1}) \overset{x}{^}_{t} = \tilde{a} \overset{x}{^}_{t} + O (ϵ),

a \overset{x}{^}_{t} = \overline{a} \overset{x}{^}_{t} = \int_{Y} (a_{11} \overset{x}{^}_{t} + a_{12} y) p (y ∣ \overset{x}{^}_{t}) d y = (a_{11} + a_{12} s_{21} s_{11}^{- 1}) \overset{x}{^}_{t} = \tilde{a} \overset{x}{^}_{t} + O (ϵ),

d \overset{x}{^}_{t} = (a_{11} \overset{x}{^}_{t} + a_{12} E [Y ∣ \hat{x}_{t - m : t}]) d t + σ_{x} d W_{x} (t),

d \overset{x}{^}_{t} = (a_{11} \overset{x}{^}_{t} + a_{12} E [Y ∣ \hat{x}_{t - m : t}]) d t + σ_{x} d W_{x} (t),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Modeling of Missing Dynamical Systems: Deriving Parametric Models using a Nonparametric Framework

Shixiao W. Jiang

Department of Mathematics, the Pennsylvania State University, 109 McAllister Building, University Park, PA 16802-6400, USA

John Harlim Corresponding author: [email protected] Department of Mathematics, the Pennsylvania State University, 109 McAllister Building, University Park, PA 16802-6400, USA

Department of Meteorology and Atmospheric Science, the Pennsylvania State University, 503 Walker Building, University Park, PA 16802-5013, USA

Institute for Computational and Data Sciences, the Pennsylvania State University, 224B Computer Building, University Park, PA 16802, USA

Abstract

In this paper, we consider modeling missing dynamics with a nonparametric non-Markovian model, constructed using the theory of kernel embedding of conditional distributions on appropriate Reproducing Kernel Hilbert Spaces (RKHS), equipped with orthonormal basis functions. Depending on the choice of the basis functions, the resulting closure model from this nonparametric modeling formulation is in the form of parametric model. This suggests that the success of various parametric modeling approaches that were proposed in various domains of applications can be understood through the RKHS representations. When the missing dynamical terms evolve faster than the relevant observable of interest, the proposed approach is consistent with the effective dynamics derived from the classical averaging theory. In the linear Gaussian case without the time-scale gap, we will show that the proposed non-Markovian model with a very long memory yields an accurate estimation of the nontrivial autocovariance function for the relevant variable of the full dynamics. Supporting numerical results on instructive nonlinear dynamics show that the proposed approach is able to replicate high-dimensional missing dynamical terms on problems with and without the separation of temporal scales.

Keywords. Missing dynamical systems, closure model, nonparametric non-Markovian model, kernel embedding

Mathematics Subject Classification. 37M10,37M25,41A10,41A45,60J22

1 Introduction

One of the long-standing issues in modeling dynamical systems is model errors arising from incomplete understanding of the physics. The progress in tackling this problem goes under different names depending on the scientific fields. In applied mathematics and engineering sciences, some of these approaches are known as the reduced-order modeling, which ultimate goal is to derive an effective model with low computational complexity from the first principle, assuming that the full dynamics is known. They include the Mori-Zwanzig formalism [44, 53, 54] and its approximations [4, 5, 12, 17, 25]; the averaging/homogenization when there is an apparent scale separation between the relevant and irrelevant variables [10, 42, 43, 50]. In domain sciences, various methods for subgrid-scale parameterization were proposed to handle the same problem that arises in applications such as material science, molecular dynamics, climate dynamics, just to name a few. They include the Markov chain type modeling [7, 23]; stochastic parameterization [1, 7, 18, 29, 32, 34, 37, 51]; superparameterization in cloud modeling [13, 24, 36] and in combustion problems [20, 21]; Direct-Interaction Approximation (DIA) for parameterizing sub-grid scale processes in isotropic turbulence [26] and its extensions [9], for modeling non-Markovian memory in inhomogeneous turbulence over topography. We should point out that this list is incomplete and these approaches share some commonality despite being developed independently and having different implementation details. Namely, the key unifying theme in these aforementioned methods is the parametric modeling assumption with specific choices of class of functions/distributions and typically having finite number of parameters.

In this paper, we consider a nonparametric modeling framework to compensate for the missing dynamical components. One of the main goals in this paper is to show that parametric modeling approaches can be understood and systematically derived from a nonparametric framework as opposed to the empirical choices of parametric models. In our setup, suppose that the underlying full dynamics is an ergodic system of Itô diffusion with relevant components $x\in\mathcal{X}$ and irrelevant components $y\in\mathcal{Y}$ . The objective is to predict the evolution of $x\in\mathcal{X}$ and its statistics, given only the $x-$ component of the full dynamics,

[TABLE]

and a historical data set $\{x_{i}:=x(t_{i}),y_{i}=y(t_{i})\}_{i=1,\ldots,N}$ . In (1), $a$ and $b$ denote, respectively, the $x-$ component of the drift and diffusion terms that are known, while $c$ and $d$ denote, respectively, the $y-$ component of the drift and diffusion terms that are not known. Also, $W_{t}$ and $V_{t}$ denote the standard (uncorrelated) Wiener processes.

While the core of the problem is similar to that in the reduced-order modeling framework, the fact that we have no knowledge of the full dynamics prohibits us to derive an effective equation from the first principle as in the standard averaging theory or Mori-Zwanzig formalism. Motivated by the practical applications where the underlying dynamics are not fully understood, instead, we will use the available historical data to reconstruct the missing dynamical components. We should point out that the restriction of knowing historical measurement of the irrelevant component, $y_{i}\in\mathcal{Y}$ , can be relaxed in some cases. When $\{x_{i}\}_{i=1\ldots,N}$ is the only available measurement, one can use, for example, likelihood maximum estimate [27, 33] in the deterministic case or an adaptive Bayesian filtering [2] (when $b$ is constant and the training data is noisy) to extract the “identifiable” components of $y_{i}$ . By identifiable components, we refer to variables that depend on $y$ that appear in $a$ and $b$ , as we shall see in our numerical examples. Abusing the notation, we will denote $y$ as the identifiable components. We will clarify this notion in our numerical examples.

Given the pair of historical time series $\{x_{i},y_{i}\}_{i=1,\ldots,N}$ with time lag $\tau=t_{i+1}-t_{i}$ , let us define $\bm{z}_{t}:=(\bm{x}_{t-m:t},\bm{y}_{t-n:t-1})\in\mathcal{Z}$ with $\bm{x}_{t-m:t}:=(x_{t-m},x_{t-m+1},\ldots,x_{t})$ and $\bm{y}_{t-n:t-1}:=(y_{t-n},y_{t-m+1},\ldots,y_{t-1})$ for some integers $m,n\in\{-1,0,\ldots\}$ . When $m=-1$ , $\bm{z}_{t}$ has only $\bm{y}$ components (similarly for $n=0$ , $\bm{z}_{t}$ has only $\bm{x}$ components). We should point out that we have reserved the index $i$ for the training data and used a different index $t$ for an arbitrary prediction time with the same lag $\tau$ . Given these time series, our modeling approach is to approximate the conditional expectations,

[TABLE]

where the expectations are defined with respect to the equilibrium conditional density $p(y|\bm{z}_{t})$ of the random variable $Y|\bm{z}_{t}$ , a short hand for $Y|\bm{z}=\bm{z}_{t}$ . Here, $Y|\bm{z}$ is nothing but the stationary random variable $Y_{t}|\bm{z}_{t}$ . Throughout this paper, we will not use the notation $Y_{t}|\bm{z}_{t}$ to avoid a potential confusion with the non-stationary time-dependent distributions.

Given these conditional statistics, the closure model is given by

[TABLE]

where $\bm{\hat{z}}_{t}:=(\bm{\hat{x}}_{t-m:t},\bm{\hat{y}}_{t-n:t-1})$ . To proceed the forecast at the next time step- $(t+1)$ , one needs to update $\bm{\hat{z}}_{t+1}$ . This variable is obtained by concatenating the components from previous time steps, $(\bm{\hat{x}}_{t+1-m:t+1},\bm{\hat{y}}_{t+1-n:t-1})$ and $\hat{y}_{t}=\mathbb{E}[Y|\bm{\hat{z}}_{t}]$ that is estimated at time $t$ .

Notice that if the $x-$ component is slow and the missing $y-$ component is fast with a scale gap denoted by a small parameter $\epsilon$ , the closure model in (3) is identical to the effective dynamics deduced by the averaging theory [22, 28, 46] when the conditional expectation in (2) is defined with respect to the invariant density of the fast dynamics $\rho_{\infty}(y;\hat{x}_{t})$ for a fixed $\hat{x}_{t}$ , if such density exists. In this specific situation (fast-slow system), by setting $\bm{\hat{z}}_{t}=\hat{x}_{t}$ , that is $m=0,n=0$ , our framework effectively closes the dynamics by averaging over $p(y|\hat{x}_{t})=p_{eq}(\hat{x}_{t},y)/\int p_{eq}(\hat{x}_{t},y)dy$ , where $p_{eq}$ denotes the invariant density of the full dynamics. We will show that averaging over $p(y|\hat{x}_{t})$ is consistent with averaging over $\rho_{\infty}(y;\hat{x}_{t})$ up to order $\epsilon$ . In general case where there is no separation of scales, the choice of $m,n$ will be problem dependent. In this case, the predictive skill of certain statistics will depend on the specific choices of $\bm{z}_{t}$ . For example, in the linear Gaussian case without a time-scale gap, we will show the existence of a conditional density $p(y|\bm{z})$ which allows for Eq. (3) to accurately estimate one-point and two-point statistics of the $x$ -components of the full dynamics.

The main idea in this paper is to consider a nonparametric representation for $p(y|\bm{z})$ using the theory of kernel embedding of conditional distributions, which was introduced in the machine learning community [48, 49]. The kernel embedding of conditional distributions [48, 49] suggests that one can represent probability distribution as an element of a reproducing kernel Hilbert space (RKHS). In this paper, we will show that if $\mathcal{H}$ is an RKHS induced by an orthonormal basis $\{\phi_{k}:\mathcal{Y}\to\mathbb{R}\}$ of an appropriate $L^{2}-$ space, then any $p(\cdot|\bm{z})\in\mathcal{H}$ for any fixed $\bm{z}\in\mathcal{Z}$ can be represented as $p(\cdot|\bm{z})=\sum_{k=1}^{\infty}c_{k}(\bm{z})\phi_{k}(\cdot)$ , where the coefficients in $c_{k}$ will be pre-computed using the historical data set and the kernel embedding of conditional distributions formula. Here, the convergence of the series representation is in uniform sense. In this paper, we will consider parametric orthonormal basis functions such as the Hermite polynomials for low-dimensional $\mathcal{Z}$ as well as the proper orthogonal decomposition (POD) modes for high-dimensional $\mathcal{Z}$ . In the latter case, we shall see that the resulting closure model in (3) is a parametric model that is well-known, namely the linear non-autonomous autoregressive model. In general, the form of parametric closure models depends on the ansatz of $\phi_{k}$ as a function of $\bm{z}$ . We should point out that one can also leave it entirely nonparametric with the data-driven basis functions constructed by the diffusion maps algorithm as in [3, 19]. While this is theoretically sound, the construction of data-driven basis functions requires an elaborate computational effort and is limited to problems with intrinsically low-dimensional $\mathcal{Y}$ . In addition to constructing the basis, the main computational cost arises when evaluating the estimated basis functions on new points $\hat{z}_{t}$ for future-time prediction. Given these constraints, we will not explore the data-driven nonparametric basis in this paper.

The remaining of the paper is organized as follows. In Section 2, we briefly review the theory of kernel embedding of conditional distributions for estimating $p(y|\bm{z})$ using an orthonormal basis representation and discuss the proposed closure models in detail. In Section 3, we provide an intuition for choosing the density $p(y|\bm{z})$ by discussing missing dynamics in a linear Gaussian dynamics with and without temporal scale gaps. In Section 4, we numerically demonstrate the proposed approach on two nonlinear high-dimensional test problems, where $m,n$ are small in the first example and large in the second example. In Section 5, we conclude the paper with a brief summary and discussion. We supplement the paper with two Appendices: Appendix A supplements Section 2 with a more detailed derivation of the kernel embedding of conditional distributions; Appendix B shows the consistency of the proposed approach in estimating autocovariance functions in the linear Gaussian case without the time-scale gap.

2 A nonparametric formulation of modeling missing dynamics

In this section, we first give a brief review on the kernel embedding of conditional distributions introduced in [48, 49], formulated using an orthonormal basis of appropriate square-integrable function spaces as in [3, 19, 52]. Subsequently, we present the proposed nonparametric modeling approach for missing dynamics.

2.1 Kernel embedding of conditional distributions

Let $\mathcal{Y}$ be a compact set and define $K:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}$ to be a kernel, which means it is symmetric positive definite and let it be bounded. By Moore-Aronszajn theorem, there exists a unique Hilbert space $\mathcal{H}=\overline{\mbox{span}\{K(y,\cdot),\forall y\in\mathcal{Y}\}}$ . Let $q:\mathcal{Y}\to\mathbb{R}$ be a positive weight function and $\{\psi_{k}q\}_{k\geq 1}$ be a set of eigenfunctions corresponding to eigenvalues $\{\lambda_{k}\}$ of the following integral operator,

[TABLE]

We should note that $\{\psi_{k}q\}$ forms an orthonormal basis of $L^{2}(\mathcal{Y},q^{-1})$ and by Mercer’s theorem, the kernel $K$ has the following representation,

[TABLE]

We should point out that if $\mathcal{Y}$ is not a compact domain such as $\mathbb{R}^{n}$ , with an exponentially decaying $q$ , one can construct a bounded Mercer-type kernel as in (5) with an appropriate choice of decreasing sequence $\{\lambda_{k}\}$ (see Lemma 3.2 in [52]) and it is a reproducing kernel corresponding to the RKHS $\mathcal{H}$ (see Proposition 3.4 in [52]). This result provides a justification for the use of Hermite polynomials $\{\psi_{k}\}$ with Gaussian weight $q$ in one of our numerical examples.

We should point out that the RKHS $\mathcal{H}$ induced by the Mercer-type kernel in (5) is a subspace of $L^{2}(\mathcal{Y},q^{-1})$ with the reproducing property corresponding to the inner product defined as $\langle f,g\rangle_{\mathcal{H}}=\sum_{k=1}^{\infty}\frac{f_{k}g_{k}}{\lambda_{k}}$ , where $f_{k}=\langle f,\psi_{k}q\rangle_{L^{2}(\mathcal{Y},q^{-1})}$ and $g_{k}=\langle g,\psi_{k}q\rangle_{L^{2}(\mathcal{Y},q^{-1})}$ . Then for any $f\in\mathcal{H}$ and $y\in\mathcal{Y}$ , we can represent,

[TABLE]

with the orthonormal basis of $L^{2}(\mathcal{Y},q^{-1})$ where the convergence of the series holds uniformly (or in $C_{0}(\mathbb{R}^{n})$ for non-compact $\mathcal{Y}=\mathbb{R}^{n}$ ).

Let ${Y}$ and ${Z}$ be random variables on $\mathcal{Y}$ and $\mathcal{Z}$ , respectively, with distribution $P(Y,Z)$ . Assuming that the conditional density $p(\cdot|\bm{z})\in\mathcal{H}$ for any fixed $\bm{z}\in\mathcal{Z}$ , we have the representation for $p(\cdot|\bm{z})$ in the RKHS $\mathcal{H}$ of real-valued functions on $\mathcal{Y}$ :

[TABLE]

where the convergence of the series is in the uniform sense and the coefficients are to be determined. The theory of kernel embedding of conditional distributions [48, 49], implemented also in [3, 19], suggests that the coefficients can be expressed as,

[TABLE]

where $\{\varphi_{l}\}_{l\geq 1}$ forms an orthonormal basis of $L^{2}(\mathcal{Z},\hat{q})$ and

[TABLE]

See the detailed derivation of (7) in Appendix A. Substituting (7) to (6), we obtain,

[TABLE]

Notice that this representation can be understood as a linear regression in infinite-dimensional spaces with respect to the basis functions $\psi_{k}q$ and $\varphi_{l}$ . Connecting to the notation in the introduction, $c_{k}(\bm{z})=\sum_{l=1}^{\infty}\left[\bm{C}_{{Y}{Z}}\bm{C}_{{ZZ}}^{-1}\right]_{kl}\varphi_{l}\left(\bm{z}\right)$ and $\phi_{k}=\psi_{k}q$ . The representation in (8) is nonparametric in the sense that we do not assume any particular distribution for the density.

Given pairs of data $\{y_{i},\bm{z}_{i}\}_{i=1,\ldots,N}$ , where $\bm{z}_{i}:=(\bm{x}_{i-m:i},\bm{y}_{i-n:i-1})\in\mathcal{Z}$ , distributed according to $P(Y,Z)$ , we can estimate these coefficients via Monte-Carlo averages:

[TABLE]

We should point out that if the weight $\hat{q}$ in $L^{2}(\mathcal{Z},\hat{q})$ is the sampling density of the data in $\mathcal{Z}$ , since $\{\varphi_{s}\}$ is orthonormal under the corresponding inner product, then $\bm{C}_{ZZ}$ is an identity matrix. While a representation on this Hilbert space is desirable, finding the corresponding orthonormal basis for high-dimensional $\mathcal{Z}$ is computationally challenging. In addition to constructing the basis, the main computational cost arises when evaluating the estimated basis functions on new points $\hat{z}_{t}$ for future-time prediction as shown in the next section. To avoid these expensive computations, we will adopt simpler basis functions, namely the Hermite polynomial basis for low-dimensional $\mathcal{Z}$ and the proper orthogonal decomposition (POD) basis for high-dimensional $\mathcal{Z}$ .

2.2 Modeling the missing dynamics

Given the pre-computed conditional density in (8), the closure modeling approach proposed in (3) requires estimating the following statistical quantities,

[TABLE]

In the discussion below, we will just focus on the expectation of $a$ (the calculation of the expectation of $bb^{\top}$ will be similar). In our formulation, we set the weight $q$ in the Hilbert space $L^{2}(\mathcal{Y},q^{-1})$ to be the sampling density of the data in $\mathcal{Y}$ . In particular, substituting (8) into (10), we obtain,

[TABLE]

where

[TABLE]

can be pre-computed. In this derivation, the second line is due to the Monte-Carlo average using data $y_{i}\sim q$ , the fourth line above used (9), and the last line is due to the truncation in the summation of the index $-k$ up to order $N$ , and the fact that,

[TABLE]

whenever $\{\psi_{k}\}$ is orthonormal in $L^{2}(\mathcal{Y},q)$ , where the weight $q$ is exactly the sampling density of $\{y_{i}\}$ . To see (13), define an $N\times N$ matrix with components $A_{ij}=\psi_{j}(y_{i})$ , then the orthonormality condition means that $A^{\top}A=I$ , where $I$ denotes an $N\times N$ identity matrix. Thus, (13) is the $(i,j)$ th component of $AA^{\top}=I$ . Since the resulting coefficients in (12) are independent to $\psi_{k}(y)$ , in practice, we only need to choose the basis $\varphi_{l}(\bm{z})$ .

Notice that the resulting representation in (11) arising from the proposed nonparametric formulation in (8) is a parametric model when the summation term is truncated, where the parametric ansatz is determined by how $\varphi_{l}$ depends on $\bm{z}$ . For example, when $\mathcal{Z}$ is low-dimensional, we will consider Hermite polynomial basis functions for $\{\varphi_{l}\left(\bm{z}\right)\}_{l=1,\ldots,L}$ in a numerical example in Section 4.1. In this case, the resulting parametric model is a polynomial of degree $-L$ and the coefficients in $A_{l}(\hat{x})$ are directly estimated via the kernel embedding formula.

We should point out that when we use the Hermite polynomial basis, we set the weight $\hat{q}$ to be Gaussian with mean and covariance determined empirically from the training data $\{\bm{z}_{i}\}_{i=1,\ldots,N}$ . In our numerics, we also employ a regularization $\left(\bm{C}_{{ZZ}}+\lambda\bm{I}\right)^{-1}$ replacing $\bm{C}_{{ZZ}}^{-1}$ in (12), with a small parameter $\lambda$ to compensate for the conditional density that is not in $\mathcal{H}$ (as suggested in [48, 49]). Basically, this regularization is the penalty of not building the appropriate RKHSs that respect the sampling distribution and geometry of the data.

For high-dimensional $\mathcal{Z}$ , we will consider using the proper orthogonal decomposition (POD) as a basis for $\varphi_{l}\left(\bm{z}\right)$ . Conceptually, this choice of basis corresponds to using an empirical covariance as the kernel in (4) (see e.g., Chapter 5 of [15] for more detailed discussion). Computationally, define a matrix $\bm{Z}\in\mathbb{R}^{N\times n_{z}}$ , where the $i$ th row consists of the training data, $\bm{z}_{i}-\bar{\bm{z}}\in\mathbb{R}^{n_{z}}$ , centered about its empirical mean, $\bar{\bm{z}}=\frac{1}{N}\sum_{i=1}^{N}\bm{z}_{i}$ , such that its row sum is zero. In this case, the function value $\varphi_{j}(\bm{z}_{i})$ will be determined by the $(i,j)$ th component of the orthonormal matrix $\bm{U}$ defined as,

[TABLE]

where $\bm{Z}=\bm{U\Sigma V}^{\top}$ is the singular value decomposition (SVD). These basis functions are called the Proper Orthogonal Decomposition (POD) modes or a discrete version of the Karhunen-Loève basis expansion (see e.g., Chapter 5 of [15]).

From the orthonormality of $\bm{U}$ , we have $\bm{C}_{{ZZ}}=\bm{I}/N$ such that Eq. (11) can be further simplified to,

[TABLE]

where we used $L$ basis functions. Suppose that $a(\hat{x},y)=y$ , then Eq. (15) can be equivalently rewritten in a matrix form as,

[TABLE]

where the matrix $\bm{Y}=[y_{1},\ldots,y_{N}]^{\top}$ is $N\times n_{y}$ with $\left\{y_{i}\right\}_{i=1}^{N}$ denoting the training data with dimension $n_{y}$ . Here, the matrix $\bm{U}_{\text{new}}:=(\bm{\hat{z}}_{t}-\bm{\bar{z}})\bm{V\Sigma}^{-1}\in\mathbb{R}^{1\times L}$ is the Nyström extension for SVD [45], whose components approximate the basis function values at a new point $\bm{\hat{z}}_{t}$ , that is, $\bm{U}_{\text{new}}\approx\left[\varphi_{1}\left(\hat{\bm{z}}_{t}\right),\ldots,\varphi_{L}\left(\hat{\bm{z}}_{t}\right)\right]$ . Substituting Eq. (14) into the conditional expectation (16), we obtain

[TABLE]

The formula in (17) is exactly a linear regression between observations $\left\{y_{i}\right\}_{i=1}^{N}$ and $\left\{\bm{z}_{i}\right\}_{i=1}^{N}$ . This means that the nonparametric RKHS representation reduces to the parametric linear regression when POD bases are used to represent functions defined on the $\mathcal{Z}$ space. In the case where $\bm{z}_{i}:=(\bm{x}_{i-m:i},\bm{y}_{i-n:i-1})\in\mathcal{Z}$ , $n_{z}=m+n+1$ , the resulting closure model in (17) is nothing but a linear autoregressive model for variable $x$ with a linear non-autonomous variable $y$ .

While the POD representation is convenient for high dimensional problems, we should point out these basis functions may not be adequate for systems with nonlinear and/or non-Gaussian nature. In fact, we will show in Section 4.2 that the POD basis representation is not sufficient to recover the missing terms in a nonlinear system even when the invariant density is close to Gaussian. In this case, we will find that an additional Gaussian white noise term can be used to compensate for the residual space (orthogonal to POD).

3 A linear Gaussian example

In this section, we provide an intuitive argument for the choice of conditional density function $p(y_{t}|\bm{z}_{t})$ in compensating the missing dynamical terms as proposed in (3). Specifically, we will build our intuition for choosing variables $\bm{z}_{t}$ by studying the missing dynamics of an analytically tractable linear Gaussian problem with and without temporal scale gaps. That is, we consider a linear multi-scale dynamical model,

[TABLE]

for a slow variable $x\in\mathbb{R}$ and a fast variable $y\in\mathbb{R}$ [11]. Here, $W_{x}$ and $W_{y}$ are independent Wiener processes. The parameters $\sigma_{x},\sigma_{y}\neq 0$ and the eigenvalues of the matrix

[TABLE]

are strictly negative, to assure the existence of a unique invariant joint density $p_{eq}\left(x,y\right)$ . The parameter $\epsilon>0$ characterizes the time-scale separation between variables $x$ and $y$ . Moreover, we assume the coefficient

[TABLE]

to assure that the leading-order slow dynamics supports an invariant measure $\hat{\rho}_{eq}(x)$ .

When there is a time-scale gap, in the limit of $\epsilon\to 0$ , the leading-order dynamics,

[TABLE]

with $a=\widetilde{a}$ as defined in (20), can be obtained by averaging the slow component of the vector field, $(a_{11}x+a_{12}y)$ , with respect to the invariant density $\rho_{\infty}(y;\hat{x}_{t})$ of the fast dynamics in (19) for a fixed $\hat{x}_{t}:=\hat{x}(t\tau)$ . For this simple example, it is clear that $\rho_{\infty}(y;x)=\mathcal{N}(-a_{21}a_{22}^{-1}x,-.5\sigma_{y}^{2}a_{22}^{-1})$ . The effective equation in (21) is deduced using the averaging theory [22, 28, 46], which approximates the density of the full dynamics as,

[TABLE]

where $\hat{\rho}(x,t)$ denotes the evolution density corresponding to the leading-order dynamics. First, we should point out that when the fast dynamics for $y$ in (19) is not available, we have no information about the invariant density $\rho_{\infty}(y;x)$ and we also cannot generate samples of this density. Thus, $\tilde{a}$ is not computable since $a_{21}$ and $a_{22}$ are unknown.

Our proposed model in (3) for the closure is motivated by the following observation. Here, we first provide the theoretical validity of our closure model. Taking $t\to\infty$ in (22), the invariant density of the full dynamics can be approximated by that of the leading-order dynamics up to order- $\epsilon$ , that is, $p_{eq}(x,y)=\hat{\rho}_{eq}(x)\rho_{\infty}(y;x)+\mathcal{O}(\epsilon)$ . Therefore,

[TABLE]

This equation basically suggests that one can approximate $\rho_{\infty}(y;x)$ with the following conditional density $p(y|x)$ . For this linear Gaussian example, one can solve the Lyapunov equation of the full system in (18)-(19) for the equilibrium covariance matrix $S=(s_{ij})_{i,j=1,2}$ and deduce that $p(y|x)=\mathcal{N}(s_{21}s_{11}^{-1}x,s_{22}-s_{21}s_{11}^{-1}s_{12})$ . Expanding the mean and variance statistics in terms of $\epsilon$ , we obtain

[TABLE]

which means that the order- $\epsilon$ expansion error in (23) is in the sense of the mean and variance.

Averaging the slow Eq. (18) with respect to this conditional density, $p(y|\hat{x}_{t})$ , we obtain a closure model of the form (21) with

[TABLE]

which means that the proposed closure obtained by averaging over $p(y_{t}|x_{t})$ is consistent (up to an order- $\epsilon$ error) with the reduced model obtained from the classical averaging theory. However, in general, such an analytical expression in (24) will not be available since we have no access to $s_{21}$ and $s_{11}$ . Numerically, we will approximate the conditional density, $p(y|x)$ , by applying the kernel embedding of the conditional distributions discussed in the previous section on the training data set $\{x_{i},y_{i}\}_{i=1}^{N}$ . In this case, it is clear that $\bm{z}_{t}=x_{t}$ is the natural choice. In the remainder of this section, we will refer to this closure model as the “RKHS $p(y_{t}|x_{t})$ ".

Now we turn to the discussion of our closure model for large $\epsilon$ . When there is no time-scale gap, i.e., $\epsilon=\mathcal{O}(1)$ is large, the approximation via the averaging theory is not valid, and thus, averaging over $p(y_{t}|x_{t})$ will not work. In this case, let us consider $\bm{z}_{t}=\bm{{x}}_{t-m:t}$ such that our closure model is an average over a non-Markovian conditional density function $p(y_{t}|\bm{{x}}_{t-m:t})$ . That is,

[TABLE]

where the conditional average is evaluated at a new data point $\bm{\hat{x}}_{t-m:t}:=\Big{(}\hat{x}((t-m)\tau),\hat{x}((t-m+1)\tau),\ldots,\hat{x}(t\tau)\Big{)}$ for the time lag interval $\tau>0$ , resulting from the integration of (25) at previous time steps. Since the random variables $Y$ of $y_{t}$ and $X$ of $\bm{x}_{t-m:t}$ are both Gaussian with mean zero and covariance,

[TABLE]

we can deduce that

[TABLE]

When the covariance components $\Sigma_{12}$ and $\Sigma_{22}$ are empirically estimated from the training data, notice that (27) is identical to the conditional expectation with respect to the kernel embedding of the conditional distributions formulated using the POD basis in (17). More importantly, one can analytically show that the autocovariance function (ACV) of the proposed non-Markovian model in (25) with $m\to\infty$ agrees with the ACV of the $x-$ component of the full model (see Appendix B for the detailed proof of this statement). The consistency of the ACV prediction as well as the closure in (27) with the RKHS formulation in (17) justifies the choice of $\bm{z}_{t}=\bm{{x}}_{t-m:t}$ when $\epsilon$ is large. In the numerics below, we will verify the robustness of the non-Markovian closure model resulted from this choice of $\bm{z}_{t}$ in terms of the short-time prediction skill and the long-time statistics of ACVs for any $\epsilon>0$ .

In Figure 1, we compare our proposed closure model in (27) , which we will refer to as “RKHS $p(y_{t}|\bm{x}_{t-m:t})$ ”, with the standard averaging model in (21) with $a=\widetilde{a}$ and the RKHS $p(y_{t}|x_{t})$ as well. In this numerical simulation, we build the closure models using the simulated data at discrete time step $\tau=0.01$ . When $\epsilon$ is small, one can observe the pathwise convergence of the solutions of the closure models to those of the full model (18)-(19) [Fig. 1(a)]. For small $\epsilon=0.01$ , the ACVs of all the closure models are in good agreement with the ACV of the full model [Fig. 1(b)]. These results agree with the invariant manifold theory for small $\epsilon$ [47]. However, when $\epsilon$ is large, the short-time predictions and the long-time ACVs become quite different among the three closure models [Figs. 1(c) and (d)]. In term of short-time predictions, the closure model (25) with $m=500$ memory terms provides a slightly better RMSE than the other two closure models [Fig. 1(c)]. In term of long-time statistics, only the closure model (25) with long memory terms produces an accurate approximation of the ACV, whereas the other two closure models do not [Fig. 1(d)]. This consistency of ACVs can be verified explicitly as we mentioned before (see Appendix B).

The analysis over this simple example shows that the proposed modeling framework using the kernel embedding of the conditional density formulation provides accurate short-time predictions and consistent long-term statistical recoveries in the limit of the memory length $m\to\infty$ . This consistency is robust whether the underlying full system has or does not have any temporal scale gap. Using this result as a guideline, a natural extension for compensating missing components in nonlinear systems is to consider $\bm{z}_{t}:=(\bm{x}_{t-m:t},\bm{y}_{t-n:t-1})$ , that allows for the missing dynamical components to also depend on the history of $y$ in addition to that of $x$ . In practice, the key parameters which will be determined case-by-case are the memory length, $m$ and $n$ , as we shall see in the nonlinear examples in the next section.

4 Nonlinear examples

In this section, we study the short-time prediction and long-time statistical properties of two nonlinear examples: the Lorenz-96 (L96) model [31] possessing a short memory effect and the truncated Burgers-Hopf (TBH) model [39, 35, 38, 41] possessing a long memory effect.

4.1 Two-layer Lorenz-96 model

Consider the two-layer Lorenz-96 (L96) model [31],

[TABLE]

for $k=1,\ldots,K,$ and $j=1,\ldots,J$ , where each relevant variable $X^{k}$ is coupled to $J$ irrelevant variables $Y^{j,k}$ , and

[TABLE]

The indices of the variables $X^{k}$ and $Y^{j,k}$ are cyclic, $X^{k}=X^{k+K},Y^{j,k}=Y^{j,k+K},Y^{j+J,k}=Y^{j,k+1}$ . The parameters are taken to be $K=18$ , $J=20$ , $F=10$ , $h_{x}=-1$ , and $h_{y}=1$ [7]. The parameter $\varepsilon$ characterizes the time scale separation between the relevant component $X^{k}$ and the irrelevant component $Y^{j,k}$ . In this example, we will show the results for a small $\varepsilon=1/128$ and a large $\varepsilon=0.5$ (the large $\varepsilon=0.5$ regime was studied in [7, 8, 29, 34]). We integrate the full L96 model using a 4th-order Runge-Kutta method for $10^{3}$ time units with a time step $\delta t=0.001$ . We observe the trajectories of the variables $(X^{k},B^{k})$ every 10 time steps, so that the observation time step is $\tau=0.01$ and the dataset contains $N=10^{5}$ observation points.

In the following numerical simulations, we compare our proposed closure RKHS models with the deterministic parametric formulation suggested by Wilk’s method [51]. In particular, the Wilk’s deterministic parameterization scheme is a closure model obtained by fitting the data $\{(X^{k}_{i},B^{k}_{i})\}_{i=1}^{N}$ with the following polynomial,

[TABLE]

We should point out that if we are restricted to only observing $\{X^{k}_{i}\}$ , then $\{B^{k}_{i}\}$ are the identifiable components that can be extracted, for example, using a likelihood maximum estimate [27, 33, 51] or an adaptive Bayesian filtering [2], as we pointed out in the introduction. The key point is that we cannot extract the detailed components $Y^{j,k}$ if the fast dynamical components in (28) are unknown and, in fact, we are not interested in constructing a closure model by averaging over conditional density that depends directly on $Y^{j,k}$ since this can be very expensive. Instead, we will consider a closure model based on averaging over the conditional density $p\left(B_{t}^{k}|X_{t}^{k}\right)$ for small $\epsilon$ , where $B_{t}^{k}:=B^{k}\left(t\tau\right)$ and $X_{t}^{k}:=X^{k}\left(t\tau\right)$ . For the large $\varepsilon$ regime, we will consider $p\left(B_{t}^{k}|X_{t}^{k},B_{t-1}^{k}\right)$ . While conditioning to other variables (e.g., spatial neighbors of $X^{k}$ or $B^{k}$ or longer temporal history) can be considered, we do not find any meaningful improvement over the results that are presented below. These densities will be constructed using the kernel embedding formulation discussed in Section 2 for each $k$ ; connecting to the notation in the previous section, $y_{t}:=B^{k}_{t}$ and $\bm{z}_{t}$ is either $X_{t}^{k}$ or $(X_{t}^{k},B_{t-1}^{k})$ . To clarify, the full problem in (30) is $K+KJ=18+18\times 20=378$ dimension, and the closure model for the missing components, $\{Y^{j,k}\}_{k=1,\ldots,K,j=1,\ldots,J}$ , is defined through a set of either one-dimensional or two-dimensional conditional densities; i.e., for each $k$ , the density takes either $X_{t}^{k}$ or $(X_{t}^{k},B_{t-1}^{k})$ as inputs. Since these densities are low-dimensional functions with respect to the conditional variables (either $X_{t}^{k}$ or $(X_{t}^{k},B_{t-1}^{k})$ ), we will represent the kernel embedding formula in (8) using the Hermite polynomials, expansion truncated at order $L=50$ for each memory term.

To validate the proposed approach, we compare the short-time predictions and long-time statistics of the $X^{k}$ -components between the full model and the closure models. Particularly, we compare several standard long-time statistical quantities as in [7, 33]:

•

The probability density function (PDF) for $X^{k}$ .

•

The autocorrelation function (ACF) for $X^{k}$ , $\left\langle X_{t}^{k}X_{0}^{k}\right\rangle/\left\langle X_{0}^{k}X_{0}^{k}\right\rangle$ , where $\langle\cdot\rangle$ denotes the temporal average over $N=10^{5}$ data points.

•

The cross-correlation function (CCF) between $X^{k}$ and $X^{k+1}$ , $\left\langle X_{t}^{k}X_{0}^{k+1}\right\rangle/\left\langle X_{0}^{k}X_{0}^{k}\right\rangle$ .

•

The mean wave amplitude $\left\langle\left|u^{m}\right|\right\rangle$ , for $m=0,\ldots,K/2$ , where $u^{m}$ is the Fourier transform of $X^{k}$ .

•

The wave variance $\left\langle\left|u^{m}-\left\langle u^{m}\right\rangle\right|^{2}\right\rangle$ .

For the PDFs, ACFs, and CCFs, we plot the average over all $k=1,\ldots,K$ . For small $\varepsilon=1/128$ , we only show the results for the PDFs and ACFs. To assess the short-time prediction skill, we calculate the root-mean-square error (RMSE) and the anomaly correlation (ANCR), where the RMSE measures the difference between the true trajectory and the forecast trajectory whereas the ANCR measures the correlation between them [7]. The definitions of RMSE and ANCR are the same as those in [7]. We take the average using the data from $1000$ different ensembles, each starting from a different initial state over five time units.

We first report the small $\varepsilon=1/128$ regime of the L96 model. Figure 2(a) displays the scatter plot of $B_{t}^{k}$ vs. $X_{t}^{k}$ for the full L96 model, the polynomial fit (30) for the deterministic parametrization of $B_{k}$ (Wilks’s method), and the expectation $\mathbb{E}\left[B_{t}^{k}|X_{t}^{k}\right]$ using the RKHS representation (method referred to as the RKHS $p\left(B_{t}^{k}|X_{t}^{k}\right)$ ). For long-time statistics, one can see from Figs. 2(b) and 2(c) that the PDFs and ACFs for $X^{k}$ can be well reproduced by both closure models. For short-time predictions, one can see from Fig. 2(d) that the RKHS $p\left(B_{t}^{k}|X_{t}^{k}\right)$ provides a better approximation of the trajectory compare to the Wilks’s deterministic parametrization scheme. These results can be expected due to the validity of the classical averaging theory on dynamical systems with time-scale separation (small $\varepsilon$ regime) [47].

We now report the L96 model for the large $\varepsilon=0.5$ regime in which there is no significant time-scale separation between the relevant, $X^{k}$ , and irrelevant variables, $B^{k}$ . By comparing Fig. 2(a) and 3(a), one can see that the patterns of the scatter plots differ substantially between the small and large $\varepsilon$ regimes. Specifically, the scatter plot for the large $\varepsilon$ regime is much broader in $B^{k}$ direction compare to that for the small $\varepsilon$ regime. This indicates that when $\varepsilon$ is small, the irrelevant (fast) variable significantly relies on the relevant (slow) variable. When $\varepsilon$ becomes large, such dependence of irrelevant variable $B^{k}$ on the relevant variable $X^{k}$ reduces.

For large $\varepsilon=0.5$ , one can observe from Fig. 3(a) that the RKHS representation of $\mathbb{E}\left[B_{t}^{k}|X_{t}^{k},B_{t-1}^{k}\right]$ can nearly reproduce the scatter plot of the full model, whereas the Wilks’s deterministic parametrization scheme and the RKHS representation $\mathbb{E}\left[B_{t}^{k}|X_{t}^{k}\right]$ cannot. The PDFs for $X^{k}$ of the full model can be reproduced by all the closure models [Fig. 3(b)]. For the other long-time statistics, ACFs, CCFs, mean wave amplitudes, and wave variances can be well reproduced only by the closure model using the conditional density $p\left(B_{t}^{k}|X_{t}^{k},B_{t-1}^{k}\right)$ [Figs. 3(c)(d)(e)(f)]. Notice also the significant improvement in terms of short-time predictions using the RKHS $p\left(B_{t}^{k}|X_{t}^{k},B_{t-1}^{k}\right)$ (smaller RMSE and higher ANCR) over the Wilks’s method and the RKHS $p\left(B_{t}^{k}|X_{t}^{k}\right)$ as shown in Fig. 4.

To determine the reliability of the ensemble forecasts, we also calculate the rank histograms from an ensemble of integrations [14]. A rank histogram is obtained by repeatedly tallying the rank of the true observation relative to the sorted $N_{\mathrm{ens}}$ -member ensemble [14]. We use the same method as in [7]. For every initial state $X_{t_{0}}^{k}$ , we do $N_{\mathrm{ens}}$ integrations of the closure models over the lead $T$ time units starting from the $X_{t_{0}}^{k}$ plus a small random perturbation. The random perturbations are Gaussian distribution with mean zeros and standard deviation $0.15$ . We sort the $N_{\mathrm{ens}}+1$ values for $X_{t}^{k}$ for each grid point $k$ and time $t$ from the ensemble members and the full L96 model. Figure 5 displays the rank histograms for all the closure models with $N_{\mathrm{ens}}=9$ at lead time $T=2$ . An ideal rank histogram is flat. One can see that the rank histogram by the RKHS $p\left(B_{t}^{k}|X_{t}^{k},B_{t-1}^{k}\right)$ is close to be flat, whereas rank histograms by Wilks’s deterministic parametrization scheme and the RKHS $p\left(B_{t}^{k}|X_{t}^{k}\right)$ exhibit U-shape distributions. Therefore, the closure model with $p\left(B_{t}^{k}|X_{t}^{k},B_{t-1}^{k}\right)$ performs better than the other two closure models.

4.2 The truncated Burgers-Hopf (TBH) model

Consider the truncated Burgers-Hopf (TBH) model [39, 35, 38, 41], which is described by a system of quadratic nonlinear equations for the complex Fourier modes, $u^{k}$ , with $u^{-k}=(u^{k})^{\ast}$ for $1\leq\left|k\right|\leq\Lambda$ ,

[TABLE]

This model is a Galerkin truncation of the inviscid Burgers equation on Fourier modes and we should point out that the dynamics of the truncated system is totally different from the inviscid Burgers equation. Particularly, the TBH exhibits intrinsic stochastic dynamics with ergodic behavior in a large deterministic system [39, 35, 38, 41]. We are interested in estimating the TBH model’s first Fourier mode given only the dynamical component of this mode,

[TABLE]

where $u_{2}$ denotes the second Fourier mode and $F$ denotes the forcing component obtained by subtracting $-\mathrm{i}(u^{1})^{\ast}u^{2}$ from the right hand side of Eq. (31), that is,

[TABLE]

While $u^{2}$ and $F$ may be identifiable from observing $u^{1}$ alone, in our experiment below, we assume that we are given the data set of $\{u^{1}_{i},u^{2}_{i},F_{i}\}_{i=1}^{N}$ . We should point out that this model has an equipartition energy, that is, all of the Fourier modes in TBH have the same variances, and the first Fourier mode (which is of our interest) possesses the longest autocorrelation time and the largest statistical memory [40], which makes this example a tough test problem.

To compensate for the missing dynamics in (32), we substitute the irrelevant variables $u^{2}$ and $F$ with their conditional expectations. In this case, the closure model involves $p(y|\bm{x}_{t-m:t-1},\bm{y}_{t-n:t-1})$ , where the irrelevant variable $y$ is one of $\{u^{2,{\rm Re}},u^{2,{\rm Im}},F^{{\rm Re}},F^{{\rm Im}}\}$ , and the relevant variable $\bm{x}$ is one of $\{u^{1,{\rm Re}},u^{1,{\rm Im}}\}$ such that $\bm{y}$ and $\bm{x}$ are both real or both imaginary parts. In particular, we employ the RKHS formulation to construct four conditional densities: $p(u^{2,{\rm Re}}_{t}|\bm{u}^{1,{\rm Re}}_{t-m:t-1},\bm{u}^{2,{\rm Re}}_{t-n:t-1})$ , $p(u^{2,{\rm Im}}_{t}|\bm{u}^{1,{\rm Im}}_{t-m:t-1},\bm{u}^{2,{\rm Im}}_{t-n:t-1})$ , $p(F^{{\rm Re}}_{t}|\bm{u}^{1,{\rm Re}}_{t-m:t-1},\bm{F}^{{\rm Re}}_{t-n:t-1})$ , and $p(F^{{\rm Im}}_{t}|\bm{u}^{1,{\rm Im}}_{t-m:t-1},\bm{F}^{{\rm Im}}_{t-n:t-1})$ . For the forcing $F$ , an additional Gaussian noise term is added to compensate for the residual space. Since the conditional states are high-dimensional (when $m,n$ are large), the conditional expectations over these densities are represented using the POD bases as in (17).

To conduct this numerical experiment, the training dataset is generated from the full TBH model (31), where $F$ is calculated by Eq. (33). We integrate the full TBH model for $10^{4}$ time units with time step $\Delta t=10^{-3}$ . We store the data at every $0.01$ time unit and thereafter the dataset contains $10^{6}$ points for all $u^{k}$ . We compare the results generated by the full TBH model (31) and the closure models, resulted by averaging the partial dynamics in (32) over the pre-trained conditional densities. In this example, we consider the full TBH model (31) in a high-energy regime with $\beta=10$ and $\Lambda=50$ as in [41]. Here, $\Lambda$ denotes number of modes in Eq. (31) and $\beta=\Lambda/\overline{E}$ with $\overline{E}$ being the mean energy per mode. Here, the full dynamics in (31) has 50-dimensional complex variables and the four conditional densities (in previous paragraph) are proposed as the closure model for the dynamics of the missing components, $\{u^{2},\ldots,u^{\Lambda}\}$ . Each of the four conditional densities above is a real-valued function that takes $m+n$ dimensional variables. In our numerical experiments, we will take $m+n\geq 20$ .

We compare three closure models of (32) with different memory terms $m$ and $n$ and different temporal steps $\tau=0.01$ or $0.1$ . Figure 6 displays long-time statistics and short-time predictions for these closure models. One can see from Figs. 6(a)(b)(c)(d) that the long-time statistics can be well reproduced by the proposed closure models of (32) when the irrelevant variables have long memory terms, that is, $n$ is large enough. In terms of short-time predictions, all three closure models exhibit comparable results for RMSEs and ANCRs where the errors saturate at about the time when the autocovariance function diminishes [Figs. 6(e) and 6(f)]. The fact that the two choices of $m,n,\tau$ ( $m=n=20,\tau=0.1$ and $m=n=200,\tau=0.01$ ), corresponding to the two models with the same memory length $n\tau=2$ time units, produce comparable results (see the red dash-dotted and black dashed curves in Fig. 6) suggests that the temporal step $\tau$ does not affect the inference. Thus it is more economical to use the model with smaller $n$ (and possibly coarser time lag $\tau$ ) that gives the same accuracy. Finally, we should also point out that if the memory length $n\tau$ (in unit time) is small, the estimates become less accurate. Therefore, for this difficult test problem involving observations of the first Fourier mode of the TBH model, the proposed closure model can replicate the long-time statistics accurately and produce reasonable short-time prediction skills when there are long enough memory terms.

5 Summary and discussion

In this paper, we considered a data-driven nonparametric model for capturing the missing dynamics in the context of systems of ergodic SDE’s and ODE’s. The non-Markovian closure model is formulated as an averaging over an equilibrium conditional density function, $p(y_{t}|\bm{x}_{t-m:t},\bm{y}_{t-n:t-1})$ , that is approximated using the kernel embedding of conditional distribution formulation. In particular, we considered a representation of the conditional density on RKHS induced by an orthonormal basis of appropriate weighted Hilbert space. A thorough investigation of the modeling framework on a linear Gaussian problem shows the consistency with the classical averaging theory for fast-slow systems and justifies our use of long non-Markovian memory terms to obtain accurate two-point statistical predictions in the case of no temporal scale separation. Numerical simulations on nonlinear problems demonstrate the robustness of the framework in producing accurate short-term predictions as well as to recover two-point statistics even when the missing terms are high-dimensional and have no separation of scales.

Modeling of missing dynamics with parametric models (or closure) has a long history as we noted in the introduction. Practically, such modeling paradigm requires modeler to choose the parametric model (ansatz) and fit the proposed model to the data to estimate the parameters in the ansatz. The choice of the parametric model is typically problem specific. When the underlying full system is known, one can deduce the model from the first principle. For example, one can apply the Mori-Zwanzig formalism to deduce such parametric model (see e.g., [4, 5, 17, 25]) and then use various mathematical tools to estimate the memory integral terms as well as the parameters in the reduced model, which remains challenging if the resulting closure model is nonlinear or contains high-dimensional parameters. For example, when a rational approximation is used as a model for the memory kernel [30], while the parameters can be identified from derivatives of the kernel, it requires the availability of highly accurate time series (in the sense of accurate several order of derivatives) which is rare in practice.

In this paper, the proposed nonparametric formulation discovered some of the well-known parametric models, including the non-autonomous autoregressive linear models. An important feature of the proposed nonparametric framework in this paper is that it translates the problem of choosing parametric model into choosing the memory length $m,n$ and constructing orthonormal basis of a weighted Hilbert space of functions that take values on $\bm{z}\in\mathcal{Z}$ . For the memory length, our experience suggests that we can use the decaying time scale of processes $x$ and $y$ as a guideline. While the natural candidate of model is a representation on a Hilbert space spanned by the orthonormal basis of functions that respect the geometry and sampling density of the data as in [19], constructing such a basis is computationally challenging especially if $\mathcal{Z}$ is high-dimensional. In addition to the difficulty in the basis construction, the main computational cost arises as we evaluate the estimated basis functions on new points for future-time prediction. For very low-dimensional $\mathcal{Z}$ , our numerical results suggest that we can avoid all of this practical issue with classical polynomial basis functions. In this case, the form of parametric model is polynomial functions. For very high-dimensional $\mathcal{Z}$ , we showed the effectiveness of using the POD basis for representing linear problems. In nonlinear problems, we found that in some case, additional noise terms can be used to compensate for the orthogonal components that are not represented by the POD bases. In this case, the resulting parametric model is a linear non-autonomous autoregressive model. The second important feature is that the proposed nonparametric framework provides a linear technique for estimating the parameters in the resulting parametric models regardless of whether they are linear or nonlinear. This important feature is inherited from the kernel embedding formulation that allows one to “gain” linearity by representing nonlinear functions of a finite dimensional space $\mathcal{Z}$ with a basis of functions of infinite dimensional linear space. To summarize, the proposed framework and the results in this study suggest that one can understand the parametric modeling paradigm from a unified framework using appropriate Reproducing Kernel Hilbert Spaces. Such realization lies on the interpretation of the Mercer’s type kernels in (5). While the so-called kernel “trick” uses Mercer kernels to avoid the evaluation of inner product in feature space, our view point is to use the Mercer kernels to construct the parametric model of interest by an appropriate choice of finite number of basis functions. Thus, this framework turns the problem of finding the right closure model into a problem of constructing a complete basis of the Hilbert space induced by the data, which remains challenging in general.

Finally, we should also point out that the modeling framework introduced here can also be realized with any supervised learning algorithm other than the kernel embedding discussed here. In a separate report, [16], we found that the closure modeling framework introduced here is effective for high-dimensional nonlinear problems when it is realized with the Long-Short-Term-Memory (a special class of Recurrent Neural Network).

Acknowledgments

It is a great pleasure to dedicate this paper to Andrew Majda on the occasion of his 70th birthday. The research of J.H. was partially supported by the ONR Grant N00014-16-1-2888, NSF Grants DMS-1619661 and DMS-1854299. S.W.J. was supported as a postdoctoral fellow under the ONR Grant N00014-16-1-2888.

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix A Kernel mean embedding of conditional distributions

The purpose of this review is to verify Eq. (7). While the derivation here follows closely the description in [49, 48], we taylor the discussion here for Mercer-type kernels induced by orthonormal basis of $L^{2}$ -spaces. Some of the basic theory of RKHS can be found in many texts, such as [6].

First, let us repeat the discussion in Section 2.1 on $\mathcal{Z}$ . Let $\mathcal{Z}$ be a compact set and define $\hat{K}:\mathcal{Z}\times\mathcal{Z}\to\mathbb{R}$ to be a kernel, which means it is symmetric positive definite and let it be bounded. By Moore-Aronszajn theorem, there exists a unique Hilbert space $\mathcal{H}_{Z}=\overline{\mbox{span}\{\hat{K}(\bm{z},\cdot),\forall\bm{z}\in\mathcal{Z}\}}$ . Let $\hat{q}:\mathcal{Z}\to\mathbb{R}$ be a positive weight function and $\{\varphi_{k}\}_{k\geq 1}$ be a set of eigenfunctions corresponding to eigenvalues $\{\xi_{k}\}$ of the following integral operator $\mathcal{\hat{K}}:L^{2}(\mathcal{Z},\hat{q})\to L^{2}(\mathcal{Z},\hat{q})$ , defined as,

[TABLE]

By Mercer’s theorem, the kernel $\hat{K}$ has the following representation,

[TABLE]

We should point out that if $\mathcal{Z}$ is not a compact domain such as $\mathbb{R}^{n}$ , with an exponentially decaying $\hat{q}$ , one can construct a bounded Mercer-type kernel as in (35) with an appropriate choice of decreasing sequence $\{\xi_{k}\}$ (see Lemma 3.2 in [52]) and it is a reproducing kernel corresponding to the RKHS $\mathcal{H}_{Z}$ (see Proposition 3.4 in [52]).

In this case, the RKHS $\mathcal{H}_{Z}$ induced by the Mercer-type kernel in (35) is a subspace of $L^{2}(\mathcal{Z},\hat{q})$ with the reproducing property corresponding to an inner product defined as $\langle f,g\rangle_{\mathcal{H}_{Z}}=\sum_{k=1}^{\infty}\frac{f_{k}g_{k}}{\xi_{k}}$ , for all $f,g\in\mathcal{H}_{Z}$ where $f_{k}=\langle f,\varphi_{k}\rangle_{L^{2}(\mathcal{Z},\hat{q})}$ and $g_{k}=\langle g,\varphi_{k}\rangle_{L^{2}(\mathcal{Z},\hat{q})}$ . Then for any $f\in\mathcal{H}_{Z}$ and $\bm{z}\in\mathcal{Z}$ , we can represent

[TABLE]

with basis of $L^{2}(\mathcal{Z},\hat{q})$ , where the convergence of the series holds uniformly (or in $C_{0}(\mathbb{R}^{n})$ for non-compact $\mathcal{Z}=\mathbb{R}^{n}$ ).

We called the Hilbert space of functions, $\mathcal{H}_{Z}$ , as an RKHS induced by the orthonormal basis of $L^{2}(\mathcal{Z},\hat{q})$ . While we have discussed $\mathcal{H}$ as an RKHS induced by the orthonormal basis of $L^{2}(\mathcal{Y},q^{-1})$ in Section 2.1, we can also repeat the argument above and construct $\mathcal{H}_{Y}$ as an RKHS induced by the orthonormal basis of $L^{2}(\mathcal{Y},q)$ . In this case, recall that while $\{\psi_{k}q\}$ are orthogonal eigenbasis of the integral operator in (4), the orthogonal basis $\psi_{k}\in L^{2}(\mathcal{Y},q)$ are eigenfunctions of an adjoint integral operator of (4). That is, one can verify that

[TABLE]

where for $f\in L^{2}(\mathcal{Y},q)$ ,

[TABLE]

and $K^{*}(x,y)=q(x)^{-1}K(x,y)q^{-1}(y)$ is also a symmetric positive definite kernel. By Mercer’s theorem, one can write

[TABLE]

Let $Y$ and $Z$ be random variables on $\mathcal{Y}$ and $\mathcal{Z}$ with distribution $P(Y,Z)$ , we define the cross-covariance operators, $\mathcal{C}_{YZ}:\mathcal{H}_{Z}\to\mathcal{H}_{Y}$ and $\mathcal{C}_{ZZ}:\mathcal{H}_{Z}\to\mathcal{H}_{Z}$ as,

[TABLE]

One can immediately see that for any $f\in\mathcal{H}_{Y}$ and $g\in\mathcal{H}_{Z}$ ,

[TABLE]

Let us define feature maps $\Psi:\mathcal{Y}\to\mathcal{F}_{Y}\subset\ell_{2}$ and $\Phi:\mathcal{Z}\to\mathcal{F}_{Z}\subset\ell_{2}$ , respectively,

[TABLE]

Then we can write

[TABLE]

where the inner products in $\mathcal{H}_{Z}$ and $\mathcal{H}_{Y}$ can be identified by $\ell_{2}$ inner products in the corresponding feature spaces. Also, for any function $f\in\mathcal{H}_{Z}$ and $\bm{z}\in\mathcal{Z}$ , we can rewrite the expansion in (36) as,

[TABLE]

where we have defined the functions $\Phi_{k}=\sqrt{\xi_{k}}\varphi_{k}\in\mathcal{H}_{Z}$ . For convenience of the discussion below, we also define the functions $\Psi_{k}:=\sqrt{\lambda_{k}}\psi_{k}\in\mathcal{H}_{Y}$ .

Using the identity in (40), we can represent the cross-operators in (39) on the basis coordinates $\Psi_{k}\in\mathcal{H}_{Y}$ and $\Phi_{\ell}\in\mathcal{H}_{Z}$ as follows:

[TABLE]

Thus, the components of the following matrix multiplication are given as,

[TABLE]

To clarify this derivation, the second equality used the definition in (43), the fourth line used the fact that $\mathcal{C}_{ZZ}^{-1}\Psi_{\ell}\in\mathcal{H}_{Z}$ can be expanded as in (42), and the rest of the lines used the standard tensor identity.

The theory of kernel mean embedding of conditional distributions (see [49, 48]) suggests that,

[TABLE]

Since $\hat{K}(\bm{z},\cdot)\in\mathcal{H}_{Z}$ , we can employ the expansion in (42) and deduce,

[TABLE]

where we have used (44) to deduce the third equality above and used the fact that $\varphi_{j}$ and $\xi_{j}$ are eigenfunction and eigenvalue of the integral operator in (34). Define,

[TABLE]

then from (43) and the definitions of the corresponding feature maps in (41),

[TABLE]

Substituting the third equation above to (46) and using the definitions of the feature maps in (41), we obtain

[TABLE]

which is exactly the claim in (7).

Appendix B ACV of the multi-scale linear Gaussian model

The full model (18)-(19) can be rewritten as

[TABLE]

where $\xi_{x}$ and $\xi_{y}$ are independent standard Gaussian noises. Similarly, the closure model (25) can be rewritten as

[TABLE]

where $\bm{x}:=\bm{x}_{t-m:t}=\left[x_{t-m},x_{t-m+1},\ldots,x_{t}\right]^{\top}$ and $\Sigma_{12}$ and $\Sigma_{22}$ are defined in Eq. (26). To simplify the notation, we drop the time indices $t-m:t$ . We also drop the “hat"-notation in $x_{t}$ and $\bm{x}_{t}$ since we will use it to denote the Fourier coefficient in this section. In this Appendix, we prove that the autocovariance (ACV) function of the closure model (48) is approximately equal to that of the full model (47) for any value of $\epsilon$ .

The Fourier transform and inverse Fourier transform is defined as

[TABLE]

The Fourier transforms of variables $x$ and $y$ of the full model (47) can be obtained as

[TABLE]

Then, for the full model (47), the resulting spectrum of $x$ is

[TABLE]

where

[TABLE]

Now we compute the Fourier transform of the closure model (48),

[TABLE]

where $\widehat{X}$ is the Fourier transform of $x_{t}$ in Eq. (48). We need to simplify the quantity

$\Sigma_{12}\Sigma_{22}^{-1}\left[\begin{array}[]{cccc}1&e^{-i\omega\tau}&\cdots&e^{-i\omega m\tau}\end{array}\right]^{\top}$ in Eq. (51). Let $S=\Sigma_{12}\Sigma_{22}^{-1}$ be the $1\times\left(m+1\right)$ vector with components denoted by $S\left[n\right]$ for $n=0,\ldots,m$ . Then, we can write

[TABLE]

which is nothing but the discrete Fourier transform of $S$ . Notice that, for any $n=0,\ldots,m$ ,

[TABLE]

where the first equality is due to the fact that the process is stationary such that $\Sigma_{22}[k,n]=\gamma_{xx,m}[n-k]$ , the second equality is due to $S\Sigma_{22}=\Sigma_{12}$ , and the last equality is by the definition of the covariance function. By the discrete convolution theorem, we have

[TABLE]

where $\widehat{\gamma}_{xx,m}$ and $\widehat{\gamma}_{xy,m}$ are the discrete Fourier transforms of $\gamma_{xx,m}$ and $\gamma_{xy,m}$ , respectively. Substituting $\widehat{S}_{m}\left(\omega\right)$ in Eq. (54) into Eq. (52), we obtain

[TABLE]

where $\widehat{\gamma}_{xx}$ and $\widehat{\gamma}_{xy}$ denote the Fourier transform of the covariance functions $\gamma_{xx}$ and $\gamma_{xy}$ .

Substituting the limiting case of Eq. (55) into Eq. (51), we can simplify the Fourier transform of the closure model as follows,

[TABLE]

Moreover, based on the Wiener-Khinchin theorem and the cross-correlation theorem, we can further simplify Eq. (56) as

[TABLE]

Substituting Eqs. (49) and (50) into above Eq. (57), we obtain the Fourier transform of the relevant variable, $\widehat{X}$ , of the closure model,

[TABLE]

which is the same as the $\widehat{x}$ of the full model in Eq. (49). Therefore, the ACV of the closure model (48) is consistent with that of the full model (47) in the limit of $m\to\infty$ . In the numerics, the error comes from the truncation of finite number of memory terms in Eq. (52).

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. Berry and J. Harlim. Linear Theory for Filtering Nonlinear Multiscale Systems with Model Error. Proc. Roy. Soc. A 20140168 , 2014.
2[2] T. Berry and J. Harlim. Semiparametric modeling: Correcting low-dimensional model error in parametric models. Journal of Computational Physics , 308:305–321, 2016.
3[3] T. Berry and J. Harlim. Correcting biased observation model error in data assimilation. Mon. Wea. Rev. , 145(7):2833–2853, 2017.
4[4] A. Chorin, O. Hald, and R. Kupferman. Optimal prediction with memory. Physica D: Nonlinear Phenomena , 166(3):239–257, 2002.
5[5] A. Chorin and P. Stinis. Problem reduction, renormalization, and memory. Communications in Applied Mathematics and Computational Science , 1(1):1–27, 2007.
6[6] A. Christmann and I. Steinwart. Support vector machines . Springer, 2008.
7[7] D. Crommelin and E. Vanden-Eijnden. Subgrid-scale parameterization with conditional markov chains. Journal of the Atmospheric Sciences , 65(8):2661–2675, 2008.
8[8] I. Fatkullin and E. Vanden-Eijnden. A computational strategy for multiscale systems with applications to lorenz 96 model. Journal of Computational Physics , 200(2):605–638, 2004.