Comparison of Deep Neural Networks and Deep Hierarchical Models for   Spatio-Temporal Data

Christopher K. Wikle

arXiv:1902.08321·stat.ML·February 25, 2019

Comparison of Deep Neural Networks and Deep Hierarchical Models for Spatio-Temporal Data

Christopher K. Wikle

PDF

Open Access

TL;DR

This paper compares deep hierarchical models and deep learning approaches for modeling complex spatio-temporal data, highlighting their differences, similarities, and potential hybrid methods for improved scientific and computational performance.

Contribution

It introduces the deep hierarchical DSTM framework, reviews deep models in machine learning, and discusses hybrid approaches combining both paradigms.

Findings

01

Deep hierarchical models effectively handle process complexity and uncertainty.

02

Deep learning models are flexible but lack probabilistic frameworks.

03

Hybrid approaches show promise for improved spatio-temporal modeling.

Abstract

Spatio-temporal data are ubiquitous in the agricultural, ecological, and environmental sciences, and their study is important for understanding and predicting a wide variety of processes. One of the difficulties with modeling spatial processes that change in time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional complex data sets and large prediction domains. It is particularly challenging to specify parameterizations for nonlinear dynamic spatio-temporal models (DSTMs) that are simultaneously useful scientifically and efficient computationally. Statisticians have developed deep hierarchical models that can accommodate process complexity as well as the uncertainties in the predictions and inference. However, these models can be expensive and are typically application specific. On the other hand, the machine…

Equations100

[\mbox o b ser v a t i o n s

[\mbox o b ser v a t i o n s

=

z (s_{ij}; t_{j}) = Y (s_{ij}; t) + ϵ (s_{ij}; t_{j}),

z (s_{ij}; t_{j}) = Y (s_{ij}; t) + ϵ (s_{ij}; t_{j}),

Y (s; t) = μ (s; t) + η (s; t),

Y (s; t) = μ (s; t) + η (s; t),

\widehat{Y}(\mathbf{s}_{0};t_{0})=\mathbf{x}(\mathbf{s}_{0};t_{0})^{\prime}\widehat{\mbox{\boldmath$\beta$}}_{gls}+\mathbf{c}_{0}^{\prime}\mathbf{C}_{z}^{-1}(\mathbf{z}-\mathbf{X}\widehat{\mbox{\boldmath$\beta$}}_{gls}),

\widehat{Y}(\mathbf{s}_{0};t_{0})=\mathbf{x}(\mathbf{s}_{0};t_{0})^{\prime}\widehat{\mbox{\boldmath$\beta$}}_{gls}+\mathbf{c}_{0}^{\prime}\mathbf{C}_{z}^{-1}(\mathbf{z}-\mathbf{X}\widehat{\mbox{\boldmath$\beta$}}_{gls}),

{z}_{t}(\cdot)={\cal{H}}({Y}_{t}(\cdot),{\mbox{\boldmath$\theta$\unboldmath}}_{d,t},{\epsilon}_{t}(\cdot)),\;\;\;t=1,\ldots,T,

{z}_{t}(\cdot)={\cal{H}}({Y}_{t}(\cdot),{\mbox{\boldmath$\theta$\unboldmath}}_{d,t},{\epsilon}_{t}(\cdot)),\;\;\;t=1,\ldots,T,

{Y}_{t}(\cdot)={\cal M}({Y}_{t-1}(\cdot),{\mbox{\boldmath$\theta$\unboldmath}}_{p,t},{\eta}_{t}(\cdot)),\;\;\;t=1,2,\ldots,

{Y}_{t}(\cdot)={\cal M}({Y}_{t-1}(\cdot),{\mbox{\boldmath$\theta$\unboldmath}}_{p,t},{\eta}_{t}(\cdot)),\;\;\;t=1,2,\ldots,

Y(\mathbf{s};t)=\mathbf{x}(\mathbf{s};t)^{\prime}{\mbox{\boldmath$\beta$\unboldmath}}+\sum_{i=1}^{n_{\alpha}}\phi_{i}(\mathbf{s})\alpha_{i}(t)+\nu(\mathbf{s};t),

Y(\mathbf{s};t)=\mathbf{x}(\mathbf{s};t)^{\prime}{\mbox{\boldmath$\beta$\unboldmath}}+\sum_{i=1}^{n_{\alpha}}\phi_{i}(\mathbf{s})\alpha_{i}(t)+\nu(\mathbf{s};t),

\mathbf{Y}|\mbox{\boldmath$\alpha$}\sim Gau(\mathbf{X}\mbox{\boldmath$\beta$}+\mbox{\boldmath$\Phi$}\mbox{\boldmath$\alpha$},\mathbf{C}_{\nu}),

\mathbf{Y}|\mbox{\boldmath$\alpha$}\sim Gau(\mathbf{X}\mbox{\boldmath$\beta$}+\mbox{\boldmath$\Phi$}\mbox{\boldmath$\alpha$},\mathbf{C}_{\nu}),

\mbox{\boldmath$\alpha$}\sim Gau({\mathbf{0}},\mathbf{C}_{\alpha}).

\mbox{\boldmath$\alpha$}\sim Gau({\mathbf{0}},\mathbf{C}_{\alpha}).

\mathbf{Y}\sim Gau(\mathbf{X}\mbox{\boldmath$\beta$},\mbox{\boldmath$\Phi$}\mathbf{C}_{\alpha}\mbox{\boldmath$\Phi$}^{\prime}+\mathbf{C}_{\nu}).

\mathbf{Y}\sim Gau(\mathbf{X}\mbox{\boldmath$\beta$},\mbox{\boldmath$\Phi$}\mathbf{C}_{\alpha}\mbox{\boldmath$\Phi$}^{\prime}+\mathbf{C}_{\nu}).

\mbox R es p o n se (O u tp u t) ⟵ m_{1} ⟵ m_{2} ⟵ \dots ⟵ m_{L} (⟵ \mbox I n p u t),

\mbox R es p o n se (O u tp u t) ⟵ m_{1} ⟵ m_{2} ⟵ \dots ⟵ m_{L} (⟵ \mbox I n p u t),

\mbox D a t a M o d e l s : [d a t a ∣ p r ocess, d a t a p a r am e t er s]

\mbox D a t a M o d e l s : [d a t a ∣ p r ocess, d a t a p a r am e t er s]

\mbox P r ocess M o d e l s : [p r ocess ∣ p r ocess p a r am e t er s]

\mbox P r ocess M o d e l s : [p r ocess ∣ p r ocess p a r am e t er s]

\mbox P a r am e t er M o d e l s : [d a t a an d p r ocess p a r am e t er s] .

\mbox P a r am e t er M o d e l s : [d a t a an d p r ocess p a r am e t er s] .

\mbox P os t er i or : [p r ocess, p a r am e t er s ∣ d a t a],

\mbox P os t er i or : [p r ocess, p a r am e t er s ∣ d a t a],

\displaystyle\;\;\;\mathbf{z}_{t}|\mathbf{Y}_{t},\mbox{\boldmath$\theta$}_{h}\sim\;{\cal D}(\mathbf{H}_{t}\mathbf{Y}_{t};\mbox{\boldmath$\theta$}_{h}),

\displaystyle\;\;\;\mathbf{z}_{t}|\mathbf{Y}_{t},\mbox{\boldmath$\theta$}_{h}\sim\;{\cal D}(\mathbf{H}_{t}\mathbf{Y}_{t};\mbox{\boldmath$\theta$}_{h}),

\displaystyle f(\mathbf{Y}_{t})=\mbox{\boldmath$\mu$}_{t}+\mbox{\boldmath$\Phi$}\mbox{\boldmath$\alpha$}_{t}+\mbox{\boldmath$\nu$}_{t},

\displaystyle\mbox{\boldmath$\mu$}_{t}={\mathbf{W}}_{t}\mbox{\boldmath$\theta$}_{\mu}+{\mbox{\boldmath$\gamma$\unboldmath}}_{t},

\displaystyle{\mbox{\boldmath$\alpha$}_{t}=g(\mbox{\boldmath$\alpha$}_{t-\tau},\mathbf{x}_{t-\tau};\mbox{\boldmath$\theta$}_{\alpha};\mbox{\boldmath$\eta$}_{t})},

\displaystyle{[\mbox{\boldmath$\nu$}_{t}|\mbox{\boldmath$\theta$}_{\nu}]},

\displaystyle{[\mbox{\boldmath$\theta$}_{\alpha}|{\mbox{\boldmath$\zeta$}}]},

\displaystyle{[\mbox{\boldmath$\theta$}_{h},\mbox{\boldmath$\theta$}_{\nu},\mbox{\boldmath$\theta$}_{\mu},\mbox{\boldmath$\zeta$}]}.

{\alpha_{t}(i)=\sum_{j=1}^{p}\theta^{L}_{i,j}\;\alpha_{t-\tau}(j)+\sum_{k=1}^{p}\sum_{\ell=1}^{k}\theta^{Q}_{i,k\ell}\;\alpha_{t-\tau}(k)g(\alpha_{t-\tau}(\ell),\mathbf{x}_{t};\mbox{\boldmath$\theta$}_{g})+\eta_{t}(i)},

{\alpha_{t}(i)=\sum_{j=1}^{p}\theta^{L}_{i,j}\;\alpha_{t-\tau}(j)+\sum_{k=1}^{p}\sum_{\ell=1}^{k}\theta^{Q}_{i,k\ell}\;\alpha_{t-\tau}(k)g(\alpha_{t-\tau}(\ell),\mathbf{x}_{t};\mbox{\boldmath$\theta$}_{g})+\eta_{t}(i)},

y_{j} = g (i = 0 \sum p w_{j i} x_{i}), j = 1, \dots, J,

y_{j} = g (i = 0 \sum p w_{j i} x_{i}), j = 1, \dots, J,

z_{k} = g_{o} (j = 0 \sum J v_{k j} y_{j}), k = 1, \dots, m,

z_{k} = g_{o} (j = 0 \sum J v_{k j} y_{j}), k = 1, \dots, m,

z_{k} (x; W, V) = g_{o} (j = 0 \sum J v_{k j} g (i = 0 \sum p w_{j i} x_{i})),

z_{k} (x; W, V) = g_{o} (j = 0 \sum J v_{k j} g (i = 0 \sum p w_{j i} x_{i})),

z (x) = g_{o, V_{L}} (g_{W_{L}} (\dots g_{W_{1}} (x))),

z (x) = g_{o, V_{L}} (g_{W_{L}} (\dots g_{W_{1}} (x))),

z = V y_{2},

z = V y_{2},

y_{2} = g (W_{2} y_{1} + w_{0, 2}),

y_{2} = g (W_{2} y_{1} + w_{0, 2}),

y_{1} = g (W_{1} x + w_{0, 1}),

y_{1} = g (W_{1} x + w_{0, 1}),

k [x, y] * z [x, y] = i = - \infty \sum \infty j = - \infty \sum \infty k [i, j] g z x - i, y - j],

k [x, y] * z [x, y] = i = - \infty \sum \infty j = - \infty \sum \infty k [i, j] g z x - i, y - j],

y_{i, j}^{ℓ} = g_{p} (g (a \sum b \sum k_{a, b}^{(ℓ)} y_{i + a, j + b}^{ℓ - 1})),

y_{i, j}^{ℓ} = g_{p} (g (a \sum b \sum k_{a, b}^{(ℓ)} y_{i + a, j + b}^{ℓ - 1})),

z_{t} = g_{o} (V y_{t})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Neural Networks and Applications · Remote Sensing in Agriculture

Full text

Comparison of Deep Neural Networks and Deep Hierarchical Models for Spatio-Temporal Data

Christopher K. Wikle

Department of Statistics

University of Missouri, Columbia, MO

(February 15, 2019)

Abstract

Spatio-temporal data are ubiquitous in the agricultural, ecological, and environmental sciences, and their study is important for understanding and predicting a wide variety of processes. One of the difficulties with modeling spatial processes that change in time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional complex datasets and large prediction domains. It is particularly challenging to specify parameterizations for nonlinear dynamic spatio-temporal models (DSTMs) that are simultaneously useful scientifically and efficient computationally. Statisticians have developed deep hierarchical models that can accommodate process complexity as well as the uncertainties in the predictions and inference. However, these models can be expensive and are typically application specific. On the other hand, the machine learning community has developed alternative “deep learning” approaches for nonlinear spatio-temporal modeling. These models are flexible yet are typically not implemented in a probabilistic framework. The two paradigms have many things in common and suggest hybrid approaches that can benefit from elements of each framework. This overview paper presents a brief introduction to the deep hierarchical DSTM (DH-DSTM) framework, and deep models in machine learning, culminating with the deep neural DSTM (DN-DSTM). Recent approaches that combine elements from DH-DSTMs and echo state network DN-DSTMs are presented as illustrations.

Keywords: Bayesian, Convolutional neural network, CNN, dynamic model, echo state network, ESN, recurrent neural network, RNN

1 Introduction

Deep learning is a type of machine learning (ML) that exploits a connected hierarchical set of models to predict or classify elements of complex data sets. The ML deep learning revolution is relatively recent and primarily associated with neural models such as feedforward neural networks (FNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), or some combination of these neural architectures. There are remarkable success stories associated with these approaches, such as models that can defeat experts in Go, Chess, or Shogi (Silver et al., 2016, 2018), and of course, there are failures as well (Shalev-Shwartz et al., 2017), albeit less publicized. Statisticians should not be surprised by the success (and failure) of these deep ML methods as we have been using deep hierarchical models (HMs) for years. By deep hierarchical models, we mean multi-level Bayesian hierarchical models. Indeed, many of the reasons for success and failure of deep ML and deep HMs are the same. The primary purpose of this article is to discuss some of these connections in the context of an area of great interest in agriculture, environmental, and ecological statistics – spatio-temporal modeling, and to show some ways in which deep ML model methodologies can be utilized within a traditional statistical modeling framework.

Spatio-temporal processes are ubiquitous in the environmental sciences. They describe how spatially-dependent processes change through time, subject to various forcing mechanisms. An important modeling challenge for such processes concerns how one accounts for interactions between different scales of spatial and temporal variability, internal to the process of interest, as well as how that process interacts with other (exogenous) processes. Often spatio-temporal processes are quite nonlinear in time, at least at certain time or spatial scales. It can be difficult to model interactions for such processes in a parsimonious way, although some parametric spatio-temporal statistical models have been used in this context, often by incorporating knowledge about the underlying dynamics of the system of interest (e.g. Wikle et al., 2001; Wikle and Hooten, 2010). Such deep hierarchical dynamic spatio-temporal models (DH-DSTMs) can be quite complex. Similarly, perhaps the greatest success stories in deep ML methods have been associated with data with complex spatial and temporal dependencies. In particular, CNN models have been very successful in vision and image processing, and RNN models have exploited the complex temporal dependencies in language processing (see the overviews in Goodfellow et al., 2016; Aggarwal, 2018). Increasingly, CNN and RNN approaches are being combined to model spatio-temporal processes (e.g., Donahue et al., 2015). In this paper, we refer to such hybrid spatio-temporal models as deep neural dynamical spatio-temporal models (DN-DSTMs).

Faced with complex spatio-temporal modeling challenges, how does the environmental statistician decide which paradigm is most appropriate for their problem? DH-DSTMs and DN-DSTMs can both be challenging to implement – often requiring a great deal of training data and specialized computational algorithms. As discussed in Section 4.6, the two modeling paradigms share common (or, at least similar) solutions to these challenges. One must also consider how important uncertainty quantification (UQ) is to the problem at hand. As statisticians, we would like to think that UQ is always of fundamental importance to what we do, but the reality is that there are situations where one simply needs a prediction or classification and the UQ is secondary at best. Most DN-DSTM methods do not provide a model-based measure of uncertainty, whereas the DH-DSTM approach is built upon a framework to explicitly capture uncertainty about as many aspects of the problem as possible (data, process, and parameters). But, DN-DSTM models do have the flexibility to consider non-Markovian feedback mechanisms in time and the influence of specific events in the distant past, whereas DH-DSTMs are typically based on Markovian (i.e., autoregressive) structures. This suggests we might borrow ideas from both the DH-DSTM and DN-DSTM approaches to develop relatively parsimonious and flexible models that can accommodate real-world complexity and UQ, potentially in a computationally feasible manner. Perhaps more importantly, in some cases, these methods could be used in situations where one does not have access to tremendous sources of data (either labeled or unlabeled), especially when they are linked together with parsimonious architectures.

Section 2 provides a concise overview of spatio-temporal modeling in statistics from both the descriptive and dynamic perspective, illustrating the importance of basis-function representations. This is followed by a brief overview of deep modeling and the DH-DSTM statistical perspective in Section 3. Section 4 then gives a brief overview of deep models in machine learning and issues associated with their implementation, including deep feedforward NNs (DNNs), CNNs, RNNs, and DN-DSTMs. Section 5 then reviews some recent approaches for linking the DH-DSTM and DN-DSTM frameworks. A concluding discussion is presented in Section 6.

2 A Brief Overview of Spatio-Temporal Modeling

In statistics, we have typically been interested in spatio-temporal models that follow the general form of an observation model and a model for a spatio-temporal latent process (e.g. Cressie and Wikle, 2011; Wikle et al., 2019):

[TABLE]

where $[\;]$ denotes a generic distribution, $|$ denotes conditioning, and each component of the model is indexed in space and time. More formally, assume we are interested in a latent (unobserved) spatio-temporal process $\{Y(\mathbf{s};t):\mathbf{s}\in D_{s},t\in D_{t}\}$ where $\mathbf{s}$ is a spatial location in domain $D_{s}$ (a subset of $d$ -dimensional real space) and $t$ is a time index in temporal domain $D_{t}$ (along the one-dimensional real line). We then have observations $\{z(\mathbf{s}_{ij};t_{j})\}$ for spatial locations $\{\mathbf{s}_{ij}:i=1,\ldots,m_{j}\}$ and times $\{t_{j}:j=1,\ldots,T\}$ .

A common example of (1) for Gaussian spatio-temporal observations is given by

[TABLE]

where $\epsilon(\mathbf{s}_{ij};t_{j})\;\sim iid\;Gau(0,\sigma^{2}_{\epsilon})$ is the observation error process. The latent Gaussian spatio-temporal process (2) can be represented as

[TABLE]

where $\mu(\mathbf{s};t)$ is a spatio-temporal mean function, and $\eta(\mathbf{s};t)$ is a mean zero Gaussian process (GP) with covariance function, say $c_{\eta}(\eta(\mathbf{s};t),\eta(\mathbf{s}^{\prime},t^{\prime}))\equiv\textrm{cov}(\eta(\mathbf{s};t),\eta(\mathbf{s}^{\prime};t^{\prime}))$ . Then, $Y(\mathbf{s};t)$ is also a GP with mean function $\mu(\mathbf{s};t)$ and covariance function $c_{\eta}(\cdot,\cdot)$ . Recall, a GP is a distribution over functions that is fully specified by a mean function and covariance function defined over the spatio-temporal domain of interest (e.g., $D_{s}\times D_{t}$ ). GPs have the very useful property that all of their finite-dimensional distributions are Gaussian (i.e., normal).

Now, say we are interested in predicting the latent process at location $(\mathbf{s}_{0};t_{0})$ given the $m=\sum_{j}m_{j}$ -dimensional observation vector $\mathbf{z}\equiv\{z(\mathbf{s}_{ij};t_{j})\}$ . The spatio-temporal (universal) kriging optimal predictor is the linear predictor $\widehat{Y}(\mathbf{s}_{0};t_{0})$ that minimizes the mean squared prediction error, $E(Y(\mathbf{s}_{0};t_{0})-\widehat{Y}(\mathbf{s}_{0};t_{0}))^{2}$ :

[TABLE]

where $\mathbf{x}(\mathbf{s}_{0};t_{0})$ is a $p$ -vector of covariates known at all observation locations and at location $(\mathbf{s}_{0};t_{0})$ , $\beta$ is the associated parameter vector, $\mathbf{X}$ is the $m\times p$ matrix of covariates at observation locations, $\mathbf{C}_{z}\equiv\textrm{var}(\mathbf{z})$ is an $m\times m$ covariance matrix, $\mathbf{c}_{0}\equiv c_{\eta}(\mathbf{z},Y(\mathbf{s}_{0};t_{0})$ is the $m\times 1$ covariance vector between observation locations and the prediction location, and the generalized-least-squares (gls) estimator of $\beta$ in (5) is given by $\widehat{\mbox{\boldmath$ \beta $}}_{gls}\equiv(\mathbf{X}^{\prime}\mathbf{C}_{z}^{-1}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{C}_{z}^{-1}\mathbf{z}.$ Note that $\mathbf{C}_{z}=\mathbf{C}_{y}+\sigma^{2}_{\epsilon}\mathbf{I}=\mathbf{C}_{\eta}+\sigma^{2}_{\epsilon}\mathbf{I}$ . The associated spatio-temporal kriging variance is given by $\sigma^{2}_{Y}(\mathbf{s}_{0};t_{0})=c_{0,0}-\mathbf{c}_{0}^{\prime}\mathbf{C}_{z}^{-1}\mathbf{c}_{0}+\kappa$ , where $c_{0,0}\equiv\textrm{var}(Y(\mathbf{s}_{0};t_{0}))$ and $\kappa$ represents the uncertainty brought to the prediction due to the estimation of $\beta$ (e.g., Wikle et al., 2019). It is straight forward to modify these formulas to obtain predictions for many locations at once, and the approach can be extended to non-Gaussian data models as well, but without a closed form solution (e.g., see Cressie and Wikle, 2011).

This approach to spatio-temporal modeling is descriptive (marginal) in that it only relies on the first and second moments of the latent process $\{Y(\mathbf{s};t)\}$ . In the spatio-temporal context this is quite useful when one does not have a great deal of knowledge about the underlying process and only needs to specify a plausible spatio-temporal covariance structure (and a spatio-temporal trend) and can rely in some sense on “Tobler’s law” that nearby things in space (and time) are more related than distant things (Tobler, 1970). However, this can be challenging for complex processes as it is difficult to specify valid covariance functions that are realistic in many situations where Tobler’s law might not hold (e.g., eddy dynamics, density-dependent growth, etc.). In addition, such second-order moment-based approaches are limiting for nonlinear and non-Gaussian processes. Practically, as shown in Figure 1, these limitations are most noticeable in situations where one is forecasting multiple time steps into the future and/or must fill in large gaps in the spatio-temporal domain of interest.

2.1 Dynamic Spatio-Temporal Models (DSTMs)

The dynamical approach to spatio-temporal process modeling in statistics is based on the idea of conditioning the spatial process at the current time on the recent past (i.e., a Markov assumption). The model is primarily concerned with specifying the evolution of the spatial field through time. This specification of the evolution of the spatial process describes the etiology of the environmental process. Such specifications have traditionally worked well when one has some underlying knowledge about the process of interest to help with estimation of the transition operator that controls the evolution (e.g., Wikle and Hooten, 2010). These models are typically most effective when forecasting multiple time steps in the future and/or predicting across large regions of space in which there are no observations.

The data model in a general DSTM can be written

[TABLE]

where ${z}_{t}(\cdot)$ corresponds to the data at time $t$ , ${Y}_{t}(\cdot)$ the corresponding latent process of interest, with a linear or nonlinear mapping function, ${\cal{H}}(\cdot)$ , that relates the data to the latent process. The data model error is given by ${\epsilon}_{t}(\cdot)$ , and data model parameters are represented by ${\mbox{\boldmath$ \theta $\unboldmath}}_{d,t}$ . These parameters may vary spatially and/or temporally in general. An important assumption that is present here, as well as in the descriptive model presented above, is that the data ${z}_{t}(\cdot)$ are independent in time when conditioned on the true process, ${Y}_{t}(\cdot)$ , and parameters ${\mbox{\boldmath$ \theta $\unboldmath}}_{d,t}$ . (Note, as is customary in dynamic models, we represent the time index as a subscript here.)

The most important component of the DSTM is the dynamic process model. One can simplify this by making use of conditional independence through Markov assumptions (e.g., conditioned on the recent past, the process is independent of the process in the more distant past). For example, a first-order Markov process can be written

[TABLE]

where ${\cal M}(\cdot)$ is the evolution operator (linear or nonlinear), ${\eta}_{t}(\cdot)$ is the noise (error) process, and ${\mbox{\boldmath$ \theta $\unboldmath}}_{p,t}$ are process model parameters that may vary with time and/or space. Note that here we are assuming that time is discrete and equally spaced (although this can be relaxed). As an example, a linear evolution equation would be written, $\mathbf{Y}_{t}=\mathbf{M}\mathbf{Y}_{t-1}+\mbox{\boldmath$ \eta $}_{t}$ , where $\mbox{\boldmath$ \eta $}_{t}\sim Gau({\mathbf{0}},\mathbf{C}_{\eta})$ , $\mathbf{Y}_{t}$ is an $n\times 1$ vector corresponding to spatial locations, $\mathbf{M}$ is a transition matrix of dimension $n\times n$ , and $\mathbf{C}_{\eta}$ is the $n\times n$ innovation error covariance matrix (a spatial covariance matrix in this case). Typically, one would also specify a distribution for the initial state, $[{Y}_{0}(\cdot)|{\mbox{\boldmath$ \theta $\unboldmath}}_{p,0}]$ .

Finally, one either estimates the parameters in (6) and (7) directly, or assigns them distributions. As discussed below, an important part of the DH-DSTM framework is modeling these parameters as processes (e.g., spatially or temporally varying, and/or allowing them to depend on auxiliary covariate information, etc.).

2.2 Basis Function Representation

Both the descriptive and dynamic approaches to spatio-temporal modeling suffer from a curse-of-dimensionality. In the descriptive case, we need to be able to efficiently calculate the inverse $\mathbf{C}_{z}^{-1}$ , and in the dynamic case, we need to be able to estimate the parameters in the transition operator (e.g., the transition matrix $\mathbf{M}$ in the linear case). This is challenging if the number of spatial locations (data and/or prediction) is large. There are a number of ways in which these issues can be mitigated (e.g., see the overview, Heaton et al., 2018, in the context of spatial models), but a common approach to both is to consider basis function representations.

Consider expanding the spatio-temporal process in a finite-dimensional basis expansion:

[TABLE]

where $\{\phi_{i}(\mathbf{s}):i=1,\ldots,n_{\alpha}\}$ are basis functions, $\{\alpha_{i}(t):i=1,\ldots,n_{\alpha}\}$ are the associated random expansion coefficients, and $\nu(\mathbf{s};t)$ is a relatively simple spatio-temporal process sometimes needed to represent left-over fine-scale spatio-temporal random variation. Note that we could consider basis functions that are indexed in space and time, or just time (e.g., see Wikle et al., 2019).

Of course, there is a well-known connection between covariance functions, basis functions, and kernels in the context of Mercer’s theorem and the Karhunen-Loéve decomposition for GPs (e.g., see Rasmussen and Williams, 2006). But, to see the practical utility of this representation, one need only note that they allow us to to build complexity through marginalization in a computationally efficient manner. For example, recall from linear mixed model theory that we can write (in vector/matrix form) the conditional model

[TABLE]

Then, integrating (marginalizing) out the random effects $\alpha$ induces dependence:

[TABLE]

That is, we have constructed the marginal covariance matrix through the known basis functions and the dependence in the random effects: $\mathbf{C}_{y}=\mbox{\boldmath$ \Phi $}\mathbf{C}_{\alpha}\mbox{\boldmath$ \Phi $}^{\prime}+\mathbf{C}_{\nu}$ .

In this context, the main spatio-temporal dependence structure comes from either $\mathbf{C}_{\alpha}$ in the descriptive case, or $\mbox{\boldmath$ \alpha $}_{t}=\mathbf{M}_{\alpha}\mbox{\boldmath$ \alpha $}_{t-1}+\mbox{\boldmath$ \eta $}_{t}$ in the dynamic case. Then, the computational advantage of basis functions comes when one recognizes that $\{\mbox{\boldmath$ \alpha $}_{t}\}$ is simpler than $\{Y(\mathbf{s};t)\}$ so that $\mathbf{C}_{\alpha}^{-1}$ and/or $\mathbf{M}_{\alpha}$ are easy to obtain. This occurs when one is working with a low-rank system (i.e., $n_{\alpha}\ll n$ ) or when there are efficient algorithms for manipulating the basis functions and/or random effects (e.g., see Cressie and Wikle, 2011). Basis function approaches can be quite useful for spatio-temporal modeling, but there are still many situations that require more complicated process descriptions on the random effects. This is best considered from a hierarchical modeling perspective.

3 Multi-Level (Deep) Hierarchical Models

What are deep models? Although there is probably no universally agreed upon answer, one view is that a deep model is structured so that the response (output) is given by a sequence of linked (telescoping) models:

[TABLE]

where $m_{\ell}$ corresponds to the $\ell$ th model. In the statistics, this is perhaps best represented by the Bayesian hierarchical modeling framework (e.g., see Gelman and Hill, 2006; Gelman et al., 2013), in which case the input is not included at the deep end of the model, but can be in any stage, or at the top. In particular, in the context of environmental statistics, the hierarchical modeling paradigm of Berliner (1996), Wikle et al. (1998), and Cressie and Wikle (2011) considers the following general distributions/models:

[TABLE]

For inference and prediction one evaluates the posterior distribution:

[TABLE]

which is proportional to the product of the data, process, and parameter distributions given above. Typically, there are multiple sub-stages for each level, which adds to the model depth. The key to the Berliner (1996) HM paradigm (which, unfortunately, is often ignored) is that one avoids modeling second-order structure as much as possible. That is, one puts the modeling effort into the conditional mean to build dependence (complexity) through marginalization (as with the basis function illustration discussed above). So, these are linked conditional models and very much top down in the sense that inputs are usually closer to the top (data) level, although they can enter at any level in principle. The next section illustrates the general DH-DSTM deep model for complex spatio-temporal modeling.

3.1 Deep Hierarchical Dynamical Spatio-Temporal Modesl (DH-DSTMs)

Here we outline a prototypical DH-DSTM. For simplicity, and to compare to the deep ML models in Section 4, this model is presented in the context of discrete time and space, although time and/or space can be considered continuous more generally. For $t=1,\ldots,T$ ,

[TABLE]

The data model (9) specifies the distribution for $\mathbf{z}_{t}$ , which is a spatially-referenced data vector at time $t$ . Specifically, ${\cal D}(\cdot)$ is some generic distribution (e.g., exponential family; this is problem specific), $\mathbf{H}_{t}$ is a mapping matrix that maps the latent process locations to the data locations, $\mathbf{Y}_{t}$ is the spatially referenced latent process vector at time $t$ , and $\mbox{\boldmath$ \theta $}_{h}$ are data model parameters. The important assumptions in this data model are that the observation vectors are considered to be independent conditioned on the latent process, and the observation error structure is relatively simple (i.e., independent) since most of the dependence is attributed to the latent process. Note also that multiple data (input) sources can easily be accommodated as in the general Berliner (1996) framework.

The conditional mean (10) specifies a transformation (link function) $f(\cdot)$ , where $\mbox{\boldmath$ \mu $}_{t}$ is a time-varying spatial “trend” (note, this can depend on inputs, $\mathbf{x}_{t}$ ), $\Phi$ is a matrix of spatial basis functions (providing dimension reduction), $\mbox{\boldmath$ \alpha $}_{t}$ is a latent dynamical random process ( $n_{\alpha}\ll n_{y}$ ), and $\mbox{\boldmath$ \nu $}_{t}$ is a non-dynamic spatio-temporal random process (described below). The most important assumption of this portion of the model is that the latent dynamical process $\{\mbox{\boldmath$ \alpha $}_{t}\}$ is low dimensional.

The process mean is given in (11), where ${\mathbf{W}}_{t}$ contains covariate inputs to accommodate trends, biases, seasonality, etc., $\mbox{\boldmath$ \theta $}_{\mu}$ are the associated parameters, and ${\mbox{\boldmath$ \gamma $\unboldmath}}_{t}$ is an error process (typically, Gaussian). Note that more flexible functions of the covariates can be considered here (i.e., as in generalized additive models) if necessary, but most of the complex structure in the data is due to the $\mbox{\boldmath$ \alpha $}_{t}$ term described below. Note also that ${\mbox{\boldmath$ \gamma $\unboldmath}}_{t}$ is assumed to have mean zero and and is typically assumed to be independent in time and space.

The dynamic portion of the model is given by (12), where $g(\cdot)$ is the evolution operator (potentially nonlinear in ${\mbox{\boldmath$ \alpha $}_{t-\tau}}$ and inputs ${\mathbf{x}_{t-\tau}}$ ), $\mbox{\boldmath$ \theta $}_{\alpha}$ are parameters, and $\mbox{\boldmath$ \eta $}_{t}$ is a noise process (typically assumed to be Gaussian and mean zero, with dependence structure that depends on the specific problem). This model is arguably the most important part of the DH-DSTM. It is typically highly parameterized and can, if information is available, be formulated in terms of a mechanistic model, or at least is motivated by such models. Regardless, it is crucial that this dynamical model allow for interactions in the elements of $\mbox{\boldmath$ \alpha $}_{t}$ through time (see the discussion in Wikle et al., 2019, Chapter 5). As an example, consider the general quadratic nonlinear (GQN) model of Wikle and Hooten (2010):

[TABLE]

where the evolution of an individual $\mbox{\boldmath$ \alpha $}_{t}$ component is controlled by linear interactions (the first term on the right-hand side (RHS) with parameters $\theta^{L}$ ) and quadratic interactions (the second term on the RHS with parameters $\theta^{Q}$ ), plus a noise term. The function $g(\cdot;\cdot)$ is a transformation function that is used to limit the explosive growth induced by the non-linear interactions. This model is motivated by a wide variety of processes in the physical and biological sciences (see Wikle and Hooten, 2010) and can be quite flexible. However, this model is severely over-parameterized with $O(p^{3})$ parameters, and it requires either science-based hard thresholding or regularization/sparcity on teh parameters for practical implementation.

The residual spatio-temporal process is given in (13), where the distribution is determined by the specific problem. For example, a useful parameterization is to assume another basis expansion such as $\mbox{\boldmath$ \nu $}_{t}=\mbox{\boldmath$ \Psi $}\mbox{\boldmath$ \omega $}_{t}+\mbox{\boldmath$ \xi $}_{t},$ where $\Psi$ is a spatial basis function matrix, $\mbox{\boldmath$ \omega $}_{t}$ are expansion coefficients, and $\mbox{\boldmath$ \xi $}_{t}$ is a simple error process (e.g., Wikle et al., 2001). The assumption here is that the complex spatio-temporal dynamics are being captured by $\mbox{\boldmath$ \alpha $}_{t}$ , so $\mbox{\boldmath$ \omega $}_{t}$ would have a simple distribution (e.g., Gaussian with perhaps simple time dependence but independent in “ $\omega$ space”), and $\mbox{\boldmath$ \xi $}_{t}$ would be independent in time and space.

As discussed above, the dynamic model for $\mbox{\boldmath$ \alpha $}_{t}$ is likely over-parameterized and often requires regularization. Any of the common approaches for regularization in the context of Bayesian models could be used here (e.g., stochastic search variable selection, spike-and-slab, horseshoe priors, etc.; Fan and Lv (e.g., see 2010)). Lastly, we require distributions or fixed values for the remaining parameters. Importantly, in the deep DH-DSTM, these parameters may themselves be “processes” (spatial or temporal) and can include dependence on various exogenous input variables. Implementation of such a deep/complex Bayesian model is typically through problem-specific MCMC algorithms, although there have been recent attempts to consider fairly complex DSTMs in a variational Bayesian context (e.g., Quiroz et al., 2018). In general, MCMC implementations can be time consuming and require significant amounts of data, prior information, and computing resources to be successful.

3.2 DH-DTSM Example: Ocean Color

Leeds et al. (2014) used an DH-DSTM model to perform spatio-temporal prediction to fill gaps in SeaWiFS ocean color observations similar to the issue shown in Figure 1. They considered a multivariate model that, in addition to the SeaWiFS observations, included sea surface height (SSH) and sea surface temperature (SST) output from the Regional Ocean Model System (ROMS) that was coupled with a biogeochemical model for the lower trophic ecosystem. They implemented a reduced-dimension GQN process model similar to (14) as an emulator of the ROMS model (e.g., the ROMS model output was used to train prior distributions for the GQN model – analogous to ML pre-training described below). Details can be found in Leeds et al. (2014). As shown in Figure 2, the model was able to predict an eddy in the phytoplankton field despite the fact that the cloud cover in the coastal Gulf of Alaska region left persistent gaps in the SeaWiFS data. Importantly, the probabilistic nature of the model produces uncertainty measures that suggest that the biggest uncertainty is not that there was an eddy in this area, but rather its precise location.

4 Deep Neural Models

The development and application of deep neural models has advanced rapidly over the last decade. Broad overviews can be found in textbooks such as Goodfellow et al. (2016) and Aggarwal (2018). The purpose of this section is not to give such a comprehensive treatment, but rather a brief overview to facilitate the connection to DSTMs. We describe simple feedforward neural networks (NNs), deep feedforward NNs (DNNs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs). This then provides the background to discuss deep ML models for spatio-temporal data, which we call deep neural DSTMs (DN-DSTMs).

4.1 Neural Networks

We start with a very simple neural network called a single hidden layer feedforward network or single layer perceptron. Assume we have a $p$ -dimensional input vector $\mathbf{x}$ , and response (output) vector $\mathbf{z}$ , which is $m$ -dimensional (but note, $m=1$ in most nonlinear regression and binary classification problems). We now seek a nonlinear model for the responses given a transformation of the inputs through a “hidden layer” given by

[TABLE]

where $y_{j}$ is the hidden variable, $\{w_{ji}\}$ are the weights (parameters) in which $w_{j0}$ are the bias or offset (intercept) parameters (note, $x_{0}\equiv 1$ ), and $g(\cdot)$ is an activation function (e.g., a hyperbolic tangent, radial basis function, rectified linear unit, softmax, etc.). The “output layer” is then given by:

[TABLE]

where $g_{o}(\cdot)$ is an activation function (which may be the identity function), $y_{0}\equiv 1$ , and $\{v_{kj}\}$ are output weights, including an offset. One can think of the hidden layer transformation as a basis expansion of the inputs, in which case we can simply write the model as:

[TABLE]

where $\mathbf{W}=\{w_{ji}\}$ , and $\mathbf{V}=\{v_{kj}\}$ and we note that there is no explicit error term in this model.

As with traditional nonlinear regression, to estimate the parameters in such a model (i.e., “train the network”) we select an objective function in terms of $\{{\mathbf{W}},{\mathbf{V}}\}$ (e.g., squared error, cross-entropy) and then typically use a gradient-based approach to obtain parameter estimates of the $\mathbf{W}$ and $\mathbf{V}$ parameters. Traditionally, the NN community uses backpropagation to do this. Backpropagation is based on applying the chain rule to calculating the gradient, which is straight-forward and useful due to the hierarchical/compositional nature of the model. This is implemented in a two-pass algorithm that has the important feature of locality, in that each hidden unit passes and receives information only to and from units that share a connection. This facilitates computation in a parallel computing environment, which is important for large datasets.

Because the objective function consists of a sum over the training data, which can be quite large, computation of the gradient can be expensive. In addition, there may be redundant data in the training sample. One way to mitigate these issues is to consider minimizing the expected loss, which can easily be estimated by averages of small random samples (i.e., minibatches) of the training sample. This is the essence of stochastic gradient descent (SGD), which is the dominant paradigm in modern neural computing (e.g., see Goodfellow et al., 2016; Aggarwal, 2018). Not only does it help with the big data, but SGD also helps keep the optimization from local minima. Even these simple one layer NNs tend to overfit and it is important that they include some form of regularization. For example, $L_{2}$ (ridge) penalties on the weights can be added to the objective function (known as “weight decay”) or $L_{1}$ (lasso) penalties can be added, which is called “weight elimination.”

4.2 Deep Feedforward Networks (DNNs)

Many problems that have big data, such as acoustic processing, image processing, and natural language processing, have very complex structure and have provided the motivation for the development of a new generation of deep learning algorithms. These are typically NNs with many hidden layers, with outputs from one layer becoming the input to the next. We consider the number of units in each layer as the width and the number of layers the depth of the network. Having both width and depth provides a very flexible learning environment, but brings with it many challenges. DNNs utilize many of the technological innovations that underly many of the current applications of deep learning in large datasets (e.g., Hinton et al., 2012). Comprehensive overviews can be found in Goodfellow et al. (2016) and Aggarwal (2018).

A basic DNN can be represented as:

[TABLE]

where $g_{o,\mathbf{V}_{L}}$ is an output function with weights $\mathbf{V}_{L}$ and $g_{\mathbf{W}_{\ell}}$ is a nonlinear activation function depending on parameters $\mathbf{W}_{\ell}$ as in (15). The hierarchical nature of a DNN is apparent in a simple example with two hidden layers and one output layer (with an identity output function):

[TABLE]

where the dimension of the hidden vectors $\mathbf{y}_{1}$ and $\mathbf{y}_{2}$ may be different. Training follows with backpropagation in an analogous way to the one-hidden layer model. A significant challenge arises because there is typically a huge number of parameters in this model as the depth increases, which makes DNNs difficult to train. In particular, in traditional applications with relatively small numbers of labeled responses there are several issues: e.g., (1) sensitivity to the number of hidden layers and number of hidden units; (2) sensitivity to other tuning parameters (one can use cross-validation if feasible); (3) extreme sensitivity to the initial values of the weights; (4) optimization is very slow on standard computation platforms; and (5) the fitted models have a propensity to overfit.

Modifications to the basic gradient-based optimization have allowed these models to be fit to large datasets. One of the first “breakthroughs” was generative pre-training. In essence, this is an attempt to get the parameters “in the ball park” before performing the backpropagation optimization. The key idea behind generative pre-training is that one learns one layer at a time with the hidden units predicted at one level then serving as the input for training the next level. This is generative in the sense that it starts at the bottom and builds one layer at a time – ultimately generating a response. The important thing here is that the associated estimates of the weights (which are approximations) just serve as starting values for the backpropagation algorithm. The backpropagation algorithm then uses all of the information and fine tunes the estimates. It is important to note that the generative pre-training does not use labeled responses, so it is unsupervised. This gives the parameters more freedom and prevents overfitting, but the backpropagation algorithm uses the labeled responses to get the final estimates. The primary generative models are restricted Boltzman machines (RBMs) and autoencoders (e.g., Goodfellow et al., 2016). Both of these approaches have the advantage that they have undirected connections (which guide the weights towards minima that improve generalization, e.g., see Erhan et al. (2010)), are easily stacked (so that the output of one can form the input for another), and are unsupervised.

In addition to the generative pre-training, other factors have proven important for the implementation of feedforward DNNs such as: (1) use of unlabeled data to train the model (this allows more flexibility); (2) use of node dropout for regularization (shrinkage), which helps dramatically with overfitting (essentially, each node has a probability of being in the model when being trained); (3) efficient computation (i.e., these models require a lot of computational power to fit – distributed and parallel computing is essential, which has been made possible by graphical processing unit (GPU)-based parallel computing in recent years); and (4) rectified linear unit (ReLU) activation functions, $ReLU(x)=max(0,x)$ , which can lead to faster training. These models have perhaps shown the greatest success when they can also exploit the inherent multiscale nature of time and space, as with CNNs and RNNs.

4.3 Convolutional Neural Networks (CNNs)

One of the biggest success stories in deep learning has been CNNs, especially in the context of image processing. Recall the definition of a discrete convolution in two dimensions:

[TABLE]

where in practice, because there are a finite number of pixels in an image, the sums are finite. We can think of $k[\;]$ as a kernel weight function that is applied to elements of the spatial image $z[\;]$ . Depending on the kernel weights, one can get different properties associated with the image after doing the convolution (see Figure 3). Note, in practice, color images have pixels represented by a combination of red, green, and blue (RGB) pixels, so images are best thought of as tensors. One can easily modify the convolution function to operate on tensor-valued pixels.

The CNN considers a convolution of the image with unknown weights that are learned; this is done multiple times for each level to get different “feature maps.” That is, rather than specify kernel functions, CNNs learn them in a way that there is one set of kernel weights for each convolution, so the weights are shared across the image (this leads to a significant dimension reduction in the number of parameters that must be learned). This convolution step is then followed by a pooling layer (or, subsampling or down sampling). The pooling layer considers a small rectangular block from the convolutional step and subsamples or aggregates it in some way to produce a single output. Perhaps the most common pooling simply takes the block maximum (known as “max pooling”). Pooling is beneficial because it helps make the CNN less sensitive to translations of the input. Importantly, it also reduces the size of the next level image. The right panels in Figure 3 illustrate pooling.

The general structure of a CNN has alternating convolution layers followed by pooling layers, with the last layer being fully connected (as in the DNN). Typically, there are (1) multiple feature maps at the convolution stage created via multiple kernel weight matrices; (2) the convolved images go into a nonlinear activation function – usually, a ReLU function; and (3) pooling can occur across multiple feature maps. The critical stage of the CNN that requires estimation is the convolution step. Let $y_{i,j}^{\ell-1}$ correspond to the input to a convolutional step. The convolution is then given by:

[TABLE]

where $g_{p}(\cdot)$ is the pooling function, $g(\cdot)$ is a nonlinear activation (e.g.,ReLU), and $k_{ab}$ are the kernel weights that must be learned (estimate). Note, the pooling layers are simple and are not learned. As with DNNs, training the other components of the model is accomplished through a gradient descent back propagation algorithm, with the same enhancements described in Section 4.2.

4.4 Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) were originally developed in the 1980s to process sequence data. In recent years, they have been enhanced to be one of the most used and successful deep learning methods, particularly for language processing applications (e.g., speech recognition, text generation, machine translation, etc.). These models are analogous to multivariate state-space models for dynamical systems, as one might see in time series, econometrics, or spatio-temporal statistics. Consider a classical dynamical system: $\mathbf{y}_{t}={\cal M}(\mathbf{y}_{t-1};{\mbox{\boldmath$ \theta $\unboldmath}})$ , where $\mathbf{y}_{t}$ represents the state of the system at time $t$ . This is considered “recurrent” because the state at time $t$ refers back to the state at time $t-1$ , etc. We can rewrite this in a so-called “unfolded” form, $\mathbf{y}_{t}={\cal M}({\cal M}({\cal M}(\mathbf{y}_{t-3};\mbox{\boldmath$ \theta $});\mbox{\boldmath$ \theta $});\mbox{\boldmath$ \theta $}\cdots)$ . Note, the parameters $\theta$ are shared across all states. The hidden states are then related to an output $\mathbf{z}_{t}$ in an observation equation. As with state-space models, the states in the RNN setting are likely to depend on external inputs, $\mathbf{x}_{t}$ .

Thus, the most basic (“vanilla”) RNN is given by

[TABLE]

where $g_{o}(\cdot)$ is an output function, $g(\cdot)$ is typically a hyperbolic tangent activation function, and $\mathbf{U}$ , $\mathbf{V}$ , and $\mathbf{W}$ are weight matrices (which typically contain bias/offset terms as well). As with other NNs, to estimate the parameters, one defines a loss function and would like to optimize by backpropagation with SGD. However, there is a complication in the case of RNNs because the parameters are common across time, and so one must implement a backpropagation through time (BPTT) algorithm (e.g., see the overview in Aggarwal, 2018). A serious challenge to implementing the BPTT optimization for this vanilla RNN is the so-called vanishing gradient/exploding gradient problem. That is, the gradient can become increasingly smaller (typically) or larger as one moves through each time step (there are typically many time steps in an RNN implementation).

There are a number of modifications to RNNs that have been specified to mitigate the vanishing/exploding gradient problem. Perhaps the most common approach includes gates that break up the temporal structure, allowing some hidden states in the past to be considered at certain time steps and others to be forgotten. For example, the long short-term memory (LSTM) RNN uses gates to create time paths that have gradients that do not vanish or explode (Hochreiter and Schmidhuber, 1997). The basic LSTM structure is given as (note, $\circ$ is the Hadamard (element-wise) product):

[TABLE]

where typically $g(\cdot)$ is a sigmoid function. The input gate selects hidden units that get input to time $t$ , the forget gate selects the hidden states at the previous time to reset to 0 at time $t$ , and the output gate selects the states that will be related to the response. The memory units are crucial as they indicate when to remember or forget previous hidden states – this memory feature is not only helpful in mitigating the vanishing/exploding gradient problem, but realistic for many processes in which events in the (distant) past can influence the presence irrespective of the intervening states. A slightly simpler gated RNN that has gained recent popularity is the gated recurrent unit (GRU) RNN (Cho et al., 2014).

In general practice, gated RNNs can be computationally intensive and often require parallelized implementations, and like the standard DNN and CNNs, require a lot of training data. In the literature, the gated algorithms are considered more as “black boxes” given their complexity, which has the benefit of making them modular and connectable (see Section 4.5).

4.4.1 Echo State Networks (ESNs)

An alternative RNN that is easy to estimate and typically requires less computational resources and training data is the echo state network (ESN) (Lukoševičius and Jaeger, 2009):

[TABLE]

This looks like the basic RNN given above, but remarkably, the weight matrices $\mathbf{W}$ and $\mathbf{U}$ are sparse and chosen randomly in the ESN, so only the output matrix $\mathbf{V}$ is learned (with regularization). This use of random parameters in the nonlinear transformation is referred to generally as “reservoir computing.” One complication is that this approach requires a modification of the weights $\mathbf{W}$ (given here by $\mathbf{W}^{*}$ – see below) to ensure the “echo state property,” which essentially states that the effects of the initial conditions diminish asymptotically with time. Overall, the ESN provides an enormous reduction in parameters to be estimated and greatly simplifies the model so that $\mathbf{y}_{t}$ is simply a series of stochastic transformations of the inputs $\mathbf{x}_{t}$ based on random weights, and the $\mathbf{V}$ parameters in the output function $g_{o}(\cdot)$ can be trained as in basic statistical models (e.g., regression, logistic, softmax). The ESN usually requires more hidden units than a traditional RNN (i.e., is wider) to compensate for not learning the weights, and so one has to apply regularization when estimating $\mathbf{V}$ . We discuss ESN models in greater detail in the context of DSTMs in Section 5 below.

4.5 Deep Neural DSTMs (DN-DSTMs)

Although DNNs can be used with spatio-temporal data (Polson and Sokolov, 2017), they are not always appropriate because they do not naturally accommodate dependence structures that occur in time and space. However, given the modularity of CNNs and RNNs (i.e., they are easily “stacked” to make deeper models) it is no surprise that they can easily be combined in different ways to produce deep hybrid models for spatio-temporal data, such as video image processing and image captioning (e.g., Keren and Schuller, 2016; Tong and Tanaka, 2018). For example, images in a video can be reduced by a CNN to find spatial features and the time evolution of these features can then be modeled with an RNN (usually an LSTM). In some cases, this framework can also be used to relate images to captions or descriptions (Donahue et al., 2015). That is, the CNN is used to encode the image and the RNN is used to decode relative to a sequence of words that describes the image. The first case is clearly a spatio-temporal problem and the last is “temporal” in the context that the output (a sequence of words) has a sequential structure. In general, the ability of software packages to modularize the various machine learning components (such as CNNs and RNNs) allows developers to combine these layers in different ways. Here, our interest is with spatial processes evolving through time (i.e., analogous to the first scenario). Such approaches have been used in environmental science to produce nowcasts of precipitation (Xingjian et al., 2015).

A general approach to the hybrid DN-DSTM considers a stacked RNN but with intermediate layers that reduce dimension. This is shown schematically in Figure 4 and can be written generally as:

[TABLE]

where $g_{o}(\cdot)$ is an output function (e.g., identity for regression, softmax for classification, etc.), $g_{I}(\cdot)$ is an input function that potentially augments and/or transforms the input vector $\mathbf{x}_{t}$ , $g(\cdot)$ is some type of RNN structure (e.g., LSTM, GRU, ESN), and $\mathcal{Q}(\cdot)$ is a dimension reduction function such as a CNN or something simpler such as a principal component decomposition or some other stochastic dimension reduction approach (e.g., random projection, Bingham and Mannila (2001)). The potential parameters (weights) in each function are given by the $\theta$ s. In this framework, the components at each reduction stage, $\tilde{\mathbf{y}}_{t,\ell}$ , can influence the output, in addition to the non-reduced hidden units from Stage 1. One could also have the non-reduced hidden units from the deeper hidden stages influence the output directly as well, but this increases the number of parameters that must be learned and is typically not necessary. Finally, note that this model might be written more concisely as a telescoping functional transformation of the input:

[TABLE]

where $\Theta$ represents all of the various parameters (weights) in the functions.

The advantage of such an approach is that it naturally accommodates multiple spatial and temporal scales of variability. Note, $g_{I}(\cdot)$ acts as an encoder that transforms the inputs. For example, $g_{I}(\cdot)$ might be a CNN or it might be some other type of dimension reduction procedure (e.g., principal components, Laplacian eigenmaps, kernel convolutions, etc.). Then the $\mathcal{Q}$ functions extract important dependent features in the hidden units (that may be spatially referenced depending on the choices of $g_{I}(\cdot)$ , $g(\cdot)$ and $\mathcal{Q}$ . The various RNN levels then act to find temporal dependencies, typically at different scales in time (e.g., Graves et al., 2013; Hermans and Schrauwen, 2013). Note that one can leave out various levels; e.g., we might leave out a $\mathcal{Q}$ stage and form a stacked RNN without the intervening reduction stage (and, vice versa). Typically, such a model would be implemented via back propagation and SGD, depending on the choice for different model stages.

4.6 Connections between DH-DSTMs and DN-DSTMs

The natural question is how do the DH-DSTMs presented in Section 3.1 compare to the DN-DSTMs presented in Section 4.5? The two paradigms do have much in common in that they are both trying to do the same thing in the context of modeling complex spatio-temporal dependence. That is, both are dealing with the fact that there are multiple scales of spatio-temporal variability that interact to describe process evolution and are building that complex dependence in some sense by “marginalizing” common components. Specifically, both model frameworks: (a) consist of multiple connected telescoping levels; (b) include dimension reduction stages; (c) typically do not model second order dependence (note, GP networks and restricted Boltzmann machines are an exception); (d) can handle multiple inputs (predictors) and different output types; (e) have a very large number of parameters to estimate; (f) require a lot of training data; (g) require prior information (or, pre-training, heuristics, etc.); (h) require regularization; (i) are expensive to compute and require efficient algorithmic implementations.

The aforementioned points suggest that one of the main challenges for both the DH-DSTM and DN-DSTM frameworks is related to implementation and computation. That is, in the DH-DSTM framework, one must make many decisions concerning the types of dependence structure, whether to put structure in the covariance or the mean, the amount of mechanistic information to include, and the prior distributions, just to name a few. In addition, in these complex modeling situations, one typically must program the DH-DSTM from scratch in some relatively efficient language as the automated packages that perform Bayesian computation are often not flexible enough to accommodate DH-DSTMs, or are too inefficient (i.e., their strength in providing general solutions can be a limitation for certain specific dependence structures). Similarly, the DN-DSTM models can also have a very large number of tuning parameters and model choices (e.g., choice of $g(\cdot)$ , $\mathcal{Q}$ , the number of layers, the number of hidden units per layer, the type of regularization, pretraining, etc.). Although the aforementioned references contain suggestions for some cases, there is no universal advice for these decisions – it is very much an experience and trial–and-error endeavor. However, unlike with DH-DSTMs, there are standard software environments such as Tensor Flow, Theano, Caffe, pyTorch (and many more!) that are quite flexible and, in some sense, modular, which has increased their utility in production environments.

There are a number of other structural differences between the modeling paradigms. First, the DH-DSTM framework is based on stochastic models that include distributional error terms within a valid probability construct (i.e., the joint distribution of all random components can be written as a series of conditional models). In contrast, the DN-DSTM framework is deterministic with no error terms (note the caveat that when one uses reservoir methods (e.g., ESNs for $g(\cdot)$ ), then (17) is a stochastic transformation but not a formal stochastic model). One consequence to the lack of a probabilistic structure for the DN-DSTM is that there is no clear mechanism to produce model-based estimates of uncertainty in the prediction or classification that results from the DN-DSTM. Second, one is limited in performing inference on the parameters – although, it should be noted that this would seldom be of interest in this type of model as the parameters are typically not identifiable, highly dependent, and non-interpretable.

In addition, it is still an open problem on how to generally include known relationships (e.g., such as suggested by a mechanistic model) in the deep NN framework (although, see Karpatne et al., 2017, for recent work in this area). That said, the DN-DSTM framework does have some important advantages in that it is easy to manipulate and implement different model structures (e.g., stacking different model components) in the backpropagation estimation paradigm implemented in many of the existing software packages. Finally, in the context of spatio-temporal dynamics, it should be noted that the RNN structure can naturally accommodate non-Markovian dynamics (e.g., memory of distant past events). This last point is potentially important to environmental, ecological, and agricultural applications and has not been a concentrated focus in statistical implementations of spatio-temporal models.

5 Combining the DH-DSTM and DN-DSTM Frameworks

A natural approach to combine the DH-DSTM and DN-DSTM frameworks would be to allow the parameters in the DN-DSTM to be random, perhaps add some error terms, and then implement via a Bayesian paradigm. Although Bayesian implementations of neural nets have been considered at least since the 1990s (MacKay, 1992; Neal, 1996), it is exceedingly challenging to implement deep neural models from a fully Bayesian perspective due to the extremely large number of dependent and non-identifiable parameters (see the overview in Polson et al., 2017). Such models can be implemented in some contexts (e.g. Chatzis, 2015; Chien and Ku, 2016; Gan et al., 2016; McDermott and Wikle, 2017a) but are quite sensitive to particular data sets and are typically computationally prohibitive. More recently, approximate Bayesian methods such as variational Bayes (Tran et al., 2018), and scalable Bayesian methods (Snoek et al., 2015) have been used successfully in deep models. In the context of DN-DSTMs this is still an active area of research.

Alternatively, two relatively simple approaches have recently been used to blend the DN-DSTM and DH-DSTM paradigms. These do so in a way that also mitigates the challenges associated with implementing DH-DSTMs. That is, DH-DSTMs typically suffer from a curse of dimensionality in parameter space and require a large amount of data and fairly specialized computational algorithms and, thus, are fairly inefficient to develop and implement. The hybrid approaches mitigate these issues but still provide a flexible and effective approach to model complex spatio-temporal processes in a manner that accounts for uncertainty quantification.

5.1 An Ensemble Approach

McDermott and Wikle (2017b) made several modifications to the standard ESN model to account for a simple approach to uncertainty quantification in a spatio-temporal nonlinear forecasting setting. They considered a quadratic ESN model. That is, for $t=1,\ldots,T$ , let:

[TABLE]

where $g(\cdot)$ is an activation function (usually a hyperbolic tangent function), $\lambda_{w}$ is the “spectral radius” (the largest eigenvalue of $\mathbf{W}$ ), and $\nu$ is a scaling parameter taking values between $[0,1]$ that helps control the amount of memory in the system, $\mathbf{W}$ , $\mathbf{U}$ , $\mathbf{V}_{1}$ , and $\mathbf{V}_{2}$ are weight matrices, $\delta_{o}$ is a Dirac function, $\gamma^{w}_{i,\ell}$ , $\gamma^{u}_{i,\ell}$ denote indicator variables, and $\pi_{w}$ , $\pi_{u}$ represent the probability of a parameter in the weight matrices being 0. Note, dividing by the spectral radius in 19 ensures the echo state property mentioned previously, and $\nu$ controls the memory. The only parameters that are estimated in this model are those in $\mathbf{V}_{1}$ and $\mathbf{V}_{2}$ , and $\sigma^{2}_{\epsilon}$ from Equation (18), for which we use a ridge penalty hyperparameter, $r_{v}$ . Again, it is important to note that $\mathbf{W}$ and $\mathbf{U}$ are not estimated, but simply drawn from (20) and (21), respectively. The hyperparameters $\pi_{w}$ , $\pi_{u}$ , $a_{w}$ , $a_{u}$ , $\nu$ , and $r_{v}$ are specified as discussed below.

The modifications of the ESN that make it useful as a DSTM are the inclusion of the explicit error term, ${{\mbox{\boldmath$ \epsilon $\unboldmath}}_{t}}$ , the quadratic term $\mathbf{V}_{2}\mathbf{y}^{2}_{t}$ and, most importantly, vector embeddings of the inputs:

[TABLE]

An embedding includes lagged values of the input predictor and is important due to Takens’ theory (Takens, 1981) in dynamical systems that states that one can represent a state space of high dimension by a sufficiently large number of lagged values of a portion of the state space. Note that the results are not very sensitive to $\{\pi_{w},\pi_{u},a_{w},a_{u}\}$ and they are usually fixed at small values, but the results can be sensitive to $\{n_{h},\nu,r_{v}\}$ , so they are chosen by cross-validation.

McDermott and Wikle (2017b) consider a simple ensemble forecast approach (analogous to a parametric bootstrap; Sheng et al. (2013), in which multiple samples from the reservoir matrices $\mathbf{W}$ and $\mathbf{U}$ are drawn and the model is refit for each parameter set. This gives a distribution of the output predictions and allows the quantification of uncertainty in the predictions. They present an example in which this quadratic ensemble ESN (Q-EESN) model is used to generate long-lead (6 month) forecasts of tropical Pacific SST (i.e., El Niño and La Niña events). The model performed very well. For example, Figure 6 shows the prediction and prediction uncertainty for a forecast of SST in December 2017 given data through June 2017 (which exhibited a La Niña event). Note, however, that the dynamical and statistical forecasts presented for this same period by the US National Oceanic and Atmospheric Agency’s Climate Prediction Center (CPC) and International Research Institute (IRI) for Climate and Society at Columbia University111https://iri.columbia.edu/our-expertise/climate/forecasts/enso/2017-July-quick-look/?enso_tab=enso-sst_table did not suggest a La Niña would develop (their probability forecast was around 15% for a La Niña for this period). The reasons for the success of the Q-EESN approach here are likely related to the fact that the ESN is a dynamic model that incorporates nonlinear interactions, but also that it augments the input space to perform a regression (Gallicchio and Micheli, 2011). That is, the dimension of $\mathbf{y}_{t}$ is typically larger than $\tilde{\mathbf{x}}_{t}$ (i.e., a dimension expansion of the potential predictors). In addition, the small, sparse, random weights limit overfitting and regularize the regression. Finally, the embedded inputs in the Q-EESN implementation allow for additional nonlinearity, and the ensemble bootstrap approach with relatively few hidden units provides a “committee of weak learners.” It is important to note that this approach takes just seconds to implement on a laptop computer compared to hours for traditional DH-DSTM approaches.

5.2 A Deep Basis Function Approach

The Q-EESN model has no mechanism to link hidden layers, which are important for processes that occur on multiple time scales. There have been deep ESN models implemented in the ML literature (e.g., Jaeger, 2007; Triefenbach et al., 2013; Antonelo et al., 2017; Ma et al., 2017; Gallicchio et al., 2018), but these approaches generally do not accommodate uncertainty quantification and are not designed for spatio-temporal systems. However, one could extend these deep ESN models to accommodate spatio-temporal processes as in (4.5). For example, McDermott and Wikle (2018) did this within an ensemble parametric bootstrap context to account for multiple time scales and uncertainty in predictions. They also consider an implementation where (4.5) is used to generate basis functions that are a stochastic transformation of the inputs. This is especially useful in a spatio-temporal regression context, i.e., when one seeks to predict one spatio-temporal process based on another. Specifically, consider the model:

[TABLE]

where $\;\;{\mathbf{y}_{t,1}^{(j)}}$ , ${\widetilde{\mathbf{y}}_{t,\ell}^{(j)}}$ are a function of $\tilde{\bf x}_{t-\tau}$ as given in (4.5), and ${\mbox{\boldmath$ \beta $\unboldmath}}_{\ell}^{(j)}$ are the associated regression coefficients for the $j$ th ensemble and $\ell$ th level. Importantly, the $y$ s are generated “offline” from an ensemble deep ESN with principal component reduction stages for $\cal{Q}$ . In addition,

[TABLE]

are fixed at small values, and the number of hidden units for all layers except the first are fixed since all of these layers go through the dimension reduction function $\cal{Q}$ . Finally,

[TABLE]

are selected by a genetic algorithm. The parametric bootstrap approach generates $j=1,\ldots,n_{res}$ ensembles of these deep ESNs by sampling different weight matrices as with the Q-EESN model (21) and (20) above.

As an example, McDermott and Wikle (2018) consider 6 month long-lead forecasts of soil moisture over the US corn belt given Pacific SST. Figure 6 shows the out of sample forecast for May 2014 given SSTs from November 2017 based on a 3-level deep ensemble ESN model. They show that this model performed the best compared to a variety of models in terms of a continuous ranked probability score and second best in terms of mean squared prediction error (the 2-level version of this model performed slightly better with this metric).

This approach is essentially a high-dimensional regression problem in which one generates a collection of basis functions by stochastic transformation of the inputs through the deep ESN model. Multiple such transformations are considered as potential predictors to give the approach flexibility and reproducibility. The large number of predictors are controlled by SSVS regularization. Note that the inputs (predictors) in this model are stochastically and dynamically transformed. Thus, the spatio-temporal regression model is not itself dynamic but, importantly, the transformations are dynamic through the ESN structure. These multiple levels of transformation allow for different time and spatial scales in the predictor variables to affect the response. Importantly, by including the dynamics in the transformation (offline), this framework is very easy to implement through regularized regression methods and it is relatively efficient (compared to deep parametric statistical models and deep ML models) due to the reservoir approach in the ESN and simple regularization. The data model here can easily accommodate other data types such as with deep Bayesian implementations of generalized linear mixed models (e.g., Tran et al., 2018).

6 Discussion

One of the fundamental principles of DH-DSTMs is that to model complex processes across multiple time and spatial scales, one benefits from considering a sequence of linked probability models. In particular, because it is very difficult to specify the dependence structure for complex (e.g., nonlinear) spatio-temporal processes, one places modeling effort into the conditional mean and takes advantage of building dependence through marginalization. Similarly, the deep neural models in ML that have become so popular in the last decade for image and language processing (e.g., DNNs, CNNs, RNNs) are also based on a sequence of linked models (typically, not stochastic models), with the outputs from one level becoming the inputs for the next. The spatio-temporal version of these models, DN-DSTMs, typically combine CNNs and RNNs and also seek to build complexity by learning which scales of spatial and/or temporal variability are important for predicting responses. These modeling frameworks have many practical issues in common, including the need for large training data sets, dimension reduction, regularization, and efficient computation. Recent approaches to mitigate some of these issues, e.g., to apply the models when one does not have a huge amount of training data, have benefited from considering reservoir computing in the context of ESNs. In spatio-temporal problems, these models have been placed in a statistical context through the use of parametric bootstrap and basis function transformation approaches. These can be implemented at a fraction of the cost of traditional DH-DSTMs but still retain a probability formulation to allow uncertainty quantification and benefit from the flexibility of DN-DSTM’s ability to flexibly model multiple time and spatial scales.

We have only scratched the surface in terms of blending the DH-DSTMs and DN-DSTMs for environmental, ecological and environmental statistics. One important challenge is to be able to include mechanistic information efficiently in this blended framework. Traditionally, it has been challenging to include such information in DN-DSTMs due to the conflict between mechanistic formulations and flexible learning formulations, and because of the challenge in training such models via gradient-based optimization. In addition, there are potential advancements that can be obtained by including ideas from deep reinforcement learning (e.g., see the overview in Aggarwal, 2018). Such methods train models in ways that they are rewarded for good decisions and penalized for poor decisions. This is the technology that was used for AlphaGo (Silver et al., 2016) and later game-playing algorithms (Silver et al., 2018). Useful connections to DH-DSTMs in environmental statistics are likely, given the long history of using reinforcement learning in control engineering. In addition, it is likely that the hybridization of DH-DSTMs and DN-DSTMs can benefit from the recent advances in generative adversarial networks (Goodfellow et al., 2014). This approach trains models in a way that benefits from two NNs competing against each other. In particular, one network generates potential solutions and the other network evaluates or discriminates these solutions. Indeed, the literature in deep neural modeling is advancing very rapidly, and it is exciting to see which of these methods and approaches can be included in more traditional probabilistic DSTM frameworks.

Acknowledgments

This work was partially supported by the US National Science Foundation (NSF) and the US Census Bureau under NSF grant SES-1132031, funded through the NSF-Census Research Network (NCRN) program, and NSF award DMS-1811745. The author would like to thank Brian Reich for encouraging the writing of this paper, Patrick McDermott for helpful discussions, and Nathan Wikle for providing helpful comments on an early draft.

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aggarwal (2018) Aggarwal, C. C. (2018), Neural networks and deep learning , Springer.
2Antonelo et al. (2017) Antonelo, E. A., Camponogara, E., and Foss, B. (2017), “Echo State Networks for data-driven downhole pressure estimation in gas-lift oil wells,” Neural Networks , 85, 106–117.
3Berliner (1996) Berliner, L. M. (1996), “Hierarchical Bayesian time series models,” in Maximum Entropy and Bayesian Methods , eds. Hanson, K. M. and Silver, R. N., Dordecht: Kluwer, Fundamental Theories of Physics, 79, pp. 15–22.
4Bingham and Mannila (2001) Bingham, E. and Mannila, H. (2001), “Random projection in dimensionality reduction: applications to image and text data,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining , ACM, pp. 245–250.
5Chatzis (2015) Chatzis, S. P. (2015), “Sparse Bayesian Recurrent Neural Networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Springer, pp. 359–372.
6Chien and Ku (2016) Chien, J.-T. and Ku, Y.-C. (2016), “Bayesian recurrent neural network for language modeling,” IEEE transactions on neural networks and learning systems , 27, 361–374.
7Cho et al. (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014), “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” ar Xiv preprint ar Xiv:1406.1078 .
8Cressie and Wikle (2011) Cressie, N. and Wikle, C. K. (2011), Statistics for Spatio-Temporal Data , Hoboken, NJ: John Wiley & Sons.