High-Dimensional Bayesian Geostatistics

Sudipto Banerjee

arXiv:1705.07265·stat.ME·May 23, 2017

High-Dimensional Bayesian Geostatistics

Sudipto Banerjee

PDF

TL;DR

This paper reviews two scalable Bayesian spatiotemporal modeling approaches—low-rank processes and Nearest-Neighbor Gaussian Processes—that enable efficient analysis of large geostatistical datasets by reducing computational complexity.

Contribution

It introduces and compares two novel methods for constructing scalable Bayesian spatiotemporal priors suitable for large datasets, addressing computational challenges in hierarchical models.

Findings

01

Both methods achieve linear computational complexity in the number of locations.

02

The approaches facilitate full Bayesian inference for large-scale spatiotemporal data.

03

Comparison provides insights into their methodological differences and applications.

Abstract

With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatiotemporal process models have become widely deployed statistical tools for researchers to better understand the complex nature of spatial and temporal variability. However, fitting hierarchical spatiotemporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. This article offers a focused review of two methods for constructing well-defined highly scalable spatiotemporal stochastic processes. Both these processes can be used as "priors" for…

Tables2

Table 1. Table 1: Parameter estimates for the predictive process (PP) and modified predictive process (MPP) models in the univariate simulation.

	$μ$	$σ^{2}$	$τ^{2}$	RMSPE
True	1	1	1
$m = 49$
PP	1.37 (0.29,2.61)	1.37 (0.65,2.37)	1.18 (1.07,1.23)	1.21
MPP	1.36 (0.51,2.39)	1.04 (0.52,1.92)	0.94 (0.68.1,14)	1.20
$m = 144$
PP	1.36 (0.52,2.32)	1.39 (0.76,2.44)	1.09 (0.96, 1.24)	1.17
MPP	1.33 (0.50,2.24)	1.14 (0.64,1.78)	0.93 (0.76,1.22)	1.17
$m = 900$
PP	1.31 (0.23, 2.55)	1.12 (0.85,1.58)	0.99 (0.85,1.16)	1.17
MPP	1.31 (0.23,2.63)	1.04 (0.76,1.49)	0.98 (0.87,1.21)	1.17

Table 2. Table 2: Posterior parameter estimates, the Kullback-Leibler divergence (KL-D) and root mean square predictive errors (RMSPE) are presented for four NNGP models constructed from different topological orderings. The four orderings from left to right are “sorted on the sum of vertical and horizontal coordinate”, maximum-minimum distance (Guinness, 2016 ) , sorted on horizontal coordinate and sorted on vertical coordinate.

NNGP from different topological orders
	True	Sorted coord(x+y)	MMD	Sorted x	Sorted y
$σ$	1	0.79 (0.69, 1.04 )	0.80 (0.69, 1.02)	0.80 (0.70, 1.05)	0.83 (0.69, 1.08)
$τ$	0.45	0.45 (0.44, 0.46)	0.45 (0.44, 0.47)	0.45 (0.44, 0.46 )	0.45 (0.44, 0.47)
$ϕ$	5	8.11 (4.42, 11.10)	7.63 (4.58, 10.97)	8.01 (4.26, 11.18)	7.12 (4.06, 11.03)
KL-D	–	24.04022	13.88847	22.30667	21.59174
RMSPE	–	0.5278996	0.5278198	0.527912	0.527807

Equations53

y (ℓ) = x^{⊤} (ℓ) β + w (ℓ) + ϵ (ℓ),

y (ℓ) = x^{⊤} (ℓ) β + w (ℓ) + ϵ (ℓ),

p (θ, β, τ) \times N (w ∣ 0, K_{θ}) \times N (y ∣ X β + w, D_{τ}),

p (θ, β, τ) \times N (w ∣ 0, K_{θ}) \times N (y ∣ X β + w, D_{τ}),

w (ℓ) \approx \tilde{w} (ℓ) = j = 1 \sum r b_{θ} (ℓ, ℓ_{j}^{*}) z (ℓ_{j}^{*}) = b_{θ}^{⊤} (ℓ) z,

w (ℓ) \approx \tilde{w} (ℓ) = j = 1 \sum r b_{θ} (ℓ, ℓ_{j}^{*}) z (ℓ_{j}^{*}) = b_{θ}^{⊤} (ℓ) z,

\mbox co v (\tilde{w} (ℓ), \tilde{w} (ℓ^{'})) = b_{θ}^{⊤} (ℓ) V_{z} b_{θ} (ℓ^{'}),

\mbox co v (\tilde{w} (ℓ), \tilde{w} (ℓ^{'})) = b_{θ}^{⊤} (ℓ) V_{z} b_{θ} (ℓ^{'}),

N (z ∣ 0, V_{z}) \times N (y ∣ B_{θ} z, D_{τ}),

N (z ∣ 0, V_{z}) \times N (y ∣ B_{θ} z, D_{τ}),

(D_{τ} + B_{θ} V_{z} B_{θ}^{⊤})^{- 1} = D_{τ}^{- 1} - D_{τ}^{- 1} B_{θ} (V_{z}^{- 1} + B_{θ}^{⊤} D_{τ}^{- 1} B_{θ})^{- 1} B_{θ}^{⊤} D_{τ}^{- 1} .

(D_{τ} + B_{θ} V_{z} B_{θ}^{⊤})^{- 1} = D_{τ}^{- 1} - D_{τ}^{- 1} B_{θ} (V_{z}^{- 1} + B_{θ}^{⊤} D_{τ}^{- 1} B_{θ})^{- 1} B_{θ}^{⊤} D_{τ}^{- 1} .

det (D_{τ} + B_{θ} V_{z} B_{θ}^{⊤}) = det (V_{z}) det (D_{τ}) det (V_{z}^{- 1} + B_{θ}^{⊤} D_{τ}^{- 1} B_{θ}),

det (D_{τ} + B_{θ} V_{z} B_{θ}^{⊤}) = det (V_{z}) det (D_{τ}) det (V_{z}^{- 1} + B_{θ}^{⊤} D_{τ}^{- 1} B_{θ}),

\displaystyle\begin{array}[]{ccccc}\underbrace{\begin{bmatrix}D_{\tau}^{-1/2}y\\ 0\end{bmatrix}}&=&\underbrace{\begin{bmatrix}D_{\tau}^{-1/2}B_{\theta}\\ V_{z}^{-1/2}\end{bmatrix}}{z}&+&\underbrace{\begin{bmatrix}e_{1}\\ e_{2}\end{bmatrix}}\\ y_{\ast}&&B_{\ast}&&e_{\ast}\end{array}\;,\;\mbox{ where }\;e_{\ast}\sim N(0,I_{n+r})\;,

\displaystyle\begin{array}[]{ccccc}\underbrace{\begin{bmatrix}D_{\tau}^{-1/2}y\\ 0\end{bmatrix}}&=&\underbrace{\begin{bmatrix}D_{\tau}^{-1/2}B_{\theta}\\ V_{z}^{-1/2}\end{bmatrix}}{z}&+&\underbrace{\begin{bmatrix}e_{1}\\ e_{2}\end{bmatrix}}\\ y_{\ast}&&B_{\ast}&&e_{\ast}\end{array}\;,\;\mbox{ where }\;e_{\ast}\sim N(0,I_{n+r})\;,

\tilde{w} (ℓ) = \mbox E [w (ℓ) ∣ w^{*}] = K_{θ} (ℓ, U^{*}) K_{θ}^{- 1} (U^{*}, U^{*}) w^{*} .

\tilde{w} (ℓ) = \mbox E [w (ℓ) ∣ w^{*}] = K_{θ} (ℓ, U^{*}) K_{θ}^{- 1} (U^{*}, U^{*}) w^{*} .

\mbox v a r {w (ℓ)} = \mbox v a r {\mbox E [w (ℓ) ∣ w^{*}]} + \mbox E {\mbox v a r [w (ℓ) ∣ w^{*}]} \geq \mbox v a r {\mbox E [w (ℓ) ∣ w^{*}]},

\mbox v a r {w (ℓ)} = \mbox v a r {\mbox E [w (ℓ) ∣ w^{*}]} + \mbox E {\mbox v a r [w (ℓ) ∣ w^{*}]} \geq \mbox v a r {\mbox E [w (ℓ) ∣ w^{*}]},

K_{η, θ} (ℓ, ℓ^{'}) = K_{θ} (ℓ, ℓ^{'}) - K_{θ} (ℓ, U^{*}) K_{θ}^{- 1} (U^{*}, U^{*}) K_{θ} (U^{*}, ℓ^{'}) .

K_{η, θ} (ℓ, ℓ^{'}) = K_{θ} (ℓ, ℓ^{'}) - K_{θ} (ℓ, U^{*}) K_{θ}^{- 1} (U^{*}, U^{*}) K_{θ} (U^{*}, ℓ^{'}) .

\tilde{w}_{ϵ} (ℓ) = \tilde{w} (ℓ) + \tilde{ϵ} (ℓ),

\tilde{w}_{ϵ} (ℓ) = \tilde{w} (ℓ) + \tilde{ϵ} (ℓ),

p (θ) \times p (τ) \times N (β ∣ μ_{β}, V_{β}) \times N (z ∣ 0, V_{z, θ}) \times N (y ∣ X β + B_{θ} z, D_{τ}),

p (θ) \times p (τ) \times N (β ∣ μ_{β}, V_{β}) \times N (z ∣ 0, V_{z, θ}) \times N (y ∣ X β + B_{θ} z, D_{τ}),

(B_{θ} V_{z, θ} B_{θ}^{⊤} + D_{τ})^{- 1}

(B_{θ} V_{z, θ} B_{θ}^{⊤} + D_{τ})^{- 1}

lo g p (θ) + lo g p (τ) - \frac{1}{2} i = 1 \sum n d_{ii} + i = 1 \sum r lo g t_{ii} - \frac{1}{2} (m_{1}^{⊤} m - m_{2}^{⊤} m_{2}),

lo g p (θ) + lo g p (τ) - \frac{1}{2} i = 1 \sum n d_{ii} + i = 1 \sum r lo g t_{ii} - \frac{1}{2} (m_{1}^{⊤} m - m_{2}^{⊤} m_{2}),

p (w_{1}) i = 2 \prod n p (w_{i} ∣ w_{1}, \dots, w_{i - 1}) = i = 1 \prod n p (w_{i} ∣ w_{\mbox P a [i]}),

p (w_{1}) i = 2 \prod n p (w_{i} ∣ w_{1}, \dots, w_{i - 1}) = i = 1 \prod n p (w_{i} ∣ w_{\mbox P a [i]}),

p (w_{1}) \times p (w_{2} ∣ w_{1}) \times p (w_{3} ∣ w_{1}, w_{2}) \times p (w_{4} ∣ w_{1}, w_{2}, w_{3}) \times p (w_{5} ∣ w_{1}, w_{2}, w_{3}, w_{4})

p (w_{1}) \times p (w_{2} ∣ w_{1}) \times p (w_{3} ∣ w_{1}, w_{2}) \times p (w_{4} ∣ w_{1}, w_{2}, w_{3}) \times p (w_{5} ∣ w_{1}, w_{2}, w_{3}, w_{4})

\times p (w_{6} ∣ w_{1}, w_{2}, w_{3}, w_{4}, w_{5}) \times p (w_{7} ∣ w_{1}, w_{2}, w_{3}, w_{4}, w_{5}, w_{6}) .

w_{1}

w_{1}

\displaystyle\begin{array}[]{ll}&\texttt{for(i in 1:(n-1))}\texttt{ \{ }\\ &\qquad\texttt{a[i+1,1:i] = solve(K[1:i,1:i], K[1:i,i+1])}\\ &\qquad\texttt{d[i+1,i+1] = K[i+1,i+1] - dot(K[i+1,1:i],a[i+1,1:i])}\\ &\texttt{\}.}\end{array}

\displaystyle\begin{array}[]{ll}&\texttt{for(i in 1:(n-1))}\texttt{ \{ }\\ &\qquad\texttt{a[i+1,1:i] = solve(K[1:i,1:i], K[1:i,i+1])}\\ &\qquad\texttt{d[i+1,i+1] = K[i+1,i+1] - dot(K[i+1,1:i],a[i+1,1:i])}\\ &\texttt{\}.}\end{array}

\displaystyle\begin{array}[]{ll}&\texttt{for(i in 1:(n-1)}\texttt{ \{ }\\ &\qquad\texttt{Pa = N[i+1] \# neighbors of i+1}\\ &\qquad\texttt{a[i+1,Pa] = solve(K[Pa,Pa], K[(i+1),Pa])}\\ &\qquad\texttt{d[i+1,i+1] = K[i+1,i+1] - dot(K[(i+1),Pa], a[i+1,Pa])}\\ &\texttt{\}.}\end{array}

\displaystyle\begin{array}[]{ll}&\texttt{for(i in 1:(n-1)}\texttt{ \{ }\\ &\qquad\texttt{Pa = N[i+1] \# neighbors of i+1}\\ &\qquad\texttt{a[i+1,Pa] = solve(K[Pa,Pa], K[(i+1),Pa])}\\ &\qquad\texttt{d[i+1,i+1] = K[i+1,i+1] - dot(K[(i+1),Pa], a[i+1,Pa])}\\ &\texttt{\}.}\end{array}

N (w_{R} ∣ 0, K_{θ})

N (w_{R} ∣ 0, K_{θ})

\displaystyle N(\ell_{i})=\left\{\begin{array}[]{l}\mbox{ empty set for }i=1\\ H(\ell_{i}^{*})=\{\ell^{*}_{1},\ell^{*}_{2},\ldots,\ell^{*}_{i-1}\}\;\mbox{ for }\;i=2,3,\ldots,m\\ m\mbox{ nearest neighbors of $\ell^{*}_{i}$ among }H(\ell_{i}^{*})\;\mbox{ for }i=m+1,\ldots,n\end{array}\right.\;.

\displaystyle N(\ell_{i})=\left\{\begin{array}[]{l}\mbox{ empty set for }i=1\\ H(\ell_{i}^{*})=\{\ell^{*}_{1},\ell^{*}_{2},\ldots,\ell^{*}_{i-1}\}\;\mbox{ for }\;i=2,3,\ldots,m\\ m\mbox{ nearest neighbors of $\ell^{*}_{i}$ among }H(\ell_{i}^{*})\;\mbox{ for }i=m+1,\ldots,n\end{array}\right.\;.

w (ℓ) = i = 1 \sum r a_{i} (ℓ) w (ℓ_{i}^{*}) + η (ℓ) \mbox f or an y ℓ \in / R,

w (ℓ) = i = 1 \sum r a_{i} (ℓ) w (ℓ_{i}^{*}) + η (ℓ) \mbox f or an y ℓ \in / R,

δ^{2} (ℓ) = K_{θ} (ℓ, ℓ) - K_{θ} (ℓ, N (ℓ)) K_{θ}^{- 1} (N (ℓ), N (ℓ)) K_{θ} (N (ℓ), ℓ) .

δ^{2} (ℓ) = K_{θ} (ℓ, ℓ) - K_{θ} (ℓ, N (ℓ)) K_{θ}^{- 1} (N (ℓ), N (ℓ)) K_{θ} (N (ℓ), ℓ) .

\displaystyle\begin{array}[]{cll}Y(\ell)\,|\,g(\cdot),\beta,w(\ell)&\stackrel{{\scriptstyle ind}}{{\sim}}P_{\tau}\quad\mbox{ exponential family}\;,\\ g(\mbox{E}[Y(\ell)])&=x^{\top}(\ell)\beta+w(\ell)\;,\quad w(\ell)\sim NNGP(0,\tilde{K}_{\theta}(\cdot,\cdot))\;,\\ \{\theta,\beta,\tau\}&\sim p(\theta,\beta,\tau)\;,\end{array}

\displaystyle\begin{array}[]{cll}Y(\ell)\,|\,g(\cdot),\beta,w(\ell)&\stackrel{{\scriptstyle ind}}{{\sim}}P_{\tau}\quad\mbox{ exponential family}\;,\\ g(\mbox{E}[Y(\ell)])&=x^{\top}(\ell)\beta+w(\ell)\;,\quad w(\ell)\sim NNGP(0,\tilde{K}_{\theta}(\cdot,\cdot))\;,\\ \{\theta,\beta,\tau\}&\sim p(\theta,\beta,\tau)\;,\end{array}

Y (ℓ)

Y (ℓ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

High-dimensional Bayesian Geostatistics

Sudipto Banerjeelabel=e1][email protected] [ UCLA Department of Biostatistics

650 Charles E. Young Drive South

Los Angeles, CA 90095-1772.

Abstract

With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatiotemporal process models have become widely deployed statistical tools for researchers to better understand the complex nature of spatial and temporal variability. However, fitting hierarchical spatiotemporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. This article offers a focused review of two methods for constructing well-defined highly scalable spatiotemporal stochastic processes. Both these processes can be used as “priors” for spatiotemporal random fields. The first approach constructs a low-rank process operating on a lower-dimensional subspace. The second approach constructs a Nearest-Neighbor Gaussian Process (NNGP) that ensures sparse precision matrices for its finite realizations. Both processes can be exploited as a scalable prior embedded within a rich hierarchical modeling framework to deliver full Bayesian inference. These approaches can be described as model-based solutions for big spatiotemporal datasets. The models ensure that the algorithmic complexity has $\sim n$ floating point operations (flops), where $n$ the number of spatial locations (per iteration). We compare these methods and provide some insight into their methodological underpinnings.

Bayesian statistics,

Gaussian process,

Low rank Gaussian process,

Nearest Neighbor Gaussian process (NNGP),

Predictive process,

Sparse Gaussian process,

Spatiotemporal statistics,

keywords:

\startlocaldefs\endlocaldefs

1 Introduction

The increased availability of inexpensive, high speed computing has enabled the collection of massive amounts of spatial and spatiotemporal datasets across many fields. This has resulted in widespread deployment of sophisticated Geographic Information Systems (GIS) and related software, and the ability to investigate challenging inferential questions related to geographically-referenced data. See, for example, the books by Cressie (1993), Stein (1999), Moller and Waagepetersen (2003), Schabenberger and Gotway (2004), Gelfand et al. (2010), Cressie and Wikle (2011) and Banerjee et al. (2014) for a variety of statistical methods and applications.

This article will focus only on point-referenced data, which refers to data referenced by points with coordinates (latitude-longitude, Easting-Northing etc.). Modeling typically proceeds from a spatial or spatiotemporal process that introduces dependence among any finite collection of random variables from an underlying random field. For our purposes, we will consider the stochastic process as an uncountable set of random variables, say $\{w(\ell):\ell\in{\cal L}\}$ , over a domain of interest ${\cal L}$ , which is endowed with a probability law specifying the joint distribution for any finite sample from that set. For example, in spatial modeling ${\cal L}$ is often assumed to be a subset of points in the Euclidean space $\Re^{d}$ (usually $d=2$ or $3$ ) or, perhaps, a set of geographic coordinates over a sphere or ellipsoid. In spatiotemporal settings ${\cal L}={\cal S}\times{\cal T}$ , where ${\cal S}\subset\Re^{d}$ is the spatial region, ${\cal T}\subset[0,\infty)$ is the time domain and $\ell=(s,t)$ is a space-time coordinate with spatial location $s\in{\cal S}$ and time point $t\in{\cal T}$ (see, e.g., Gneiting and Guttorp, 2010, for details).

Such processes are specified with a covariance function $K_{\theta}(\ell,\ell^{\prime})$ that gives the covariance between $w(\ell)$ and $w(\ell^{\prime})$ for any two points $\ell$ and $\ell^{\prime}$ in ${\cal L}$ . For any finite collection ${\cal U}=\{\ell_{1},\ell_{2},\ldots,\ell_{n}\}$ in ${\cal L}$ , let $w_{\cal U}=(w(\ell_{1})),w(\ell_{2}),\ldots,w(\ell_{n}))^{\top}$ be the realizations of the process over ${\cal U}$ . Also, for two finite sets ${\cal U}$ and ${\cal V}$ containing $n$ and $m$ points in ${\cal L}$ , respectively, we define the $n\times m$ matrix $K_{\theta}({\cal U},{\cal V})=\mbox{Cov}(w_{\cal U},w_{\cal V}\,|\,\theta)$ , where the covariances are evaluated using $K_{\theta}(\cdot,\cdot)$ . When ${\cal U}$ or ${\cal V}$ contains a single point, $K_{\theta}({\cal U},{\cal V})$ is a row or column vector, respectively. A valid spatiotemporal covariance function ensures that $K_{\theta}({\cal U},{\cal U})$ is positive definite for any finite set ${\cal U}$ . In geostatistics, we usually deal with a fixed set of points ${\cal U}$ and, if the context is clear, we write $K_{\theta}({\cal U},{\cal U})$ simply as $K_{\theta}$ . A popular specification assumes $\{w(\ell):\ell\in{\cal L}\}$ is a zero-centered Gaussian process written as $w(\ell)\sim GP(0,K_{\theta}(\cdot,\cdot))$ , which implies that the $n\times 1$ vector $w=(w(\ell_{1}),w(\ell_{2})\ldots,w(\ell_{n}))^{\top}$ is distributed as $N(0,K_{\theta})$ , where $K_{\theta}$ is the $n\times n$ covariance matrix with $(i,j)$ -th element $K_{\theta}(\ell_{i},\ell_{j})$ . Various characterizations and classes of valid spatial (and spatiotemporal) covariance functions can be found in Gneiting and Guttorp (2010), Cressie (1993), Stein (1999), Gelfand et al. (2010), Cressie and Wikle (2011) and Banerjee et al. (2014) and numerous references therein. The more common assumptions are of stationarity and isotropy. The former assumes that $K_{\theta}(\ell,\ell^{\prime})=K_{\theta}(\ell-\ell^{\prime})$ depends upon the coordinates only through their separation vector, while isotropy goes a step further and assumes the covariance is a function of the distance between them.

Spatial and spatiotemporal processes are conveniently embedded within Bayesian hierarchical models. The most common geostatistical setting assumes a response or dependent variable $y(\ell)$ observed at a generic point $\ell$ along with a $p\times 1$ ( $p<n$ ) vector of spatially referenced predictors $x(\ell)$ . Model-based geostatistical data analysis customarily envisions a spatial regression model,

[TABLE]

where $\beta$ is the $p\times 1$ vector of slopes, and the residual from the regression is the sum of a spatial or spatiotemporal process, $w(\ell)\sim GP(0,K_{\theta}(\cdot,\cdot))$ capturing spatial and/or temporal association, and an independent process, $\epsilon(\ell)$ modeling measurement error or fine scale variation attributed to disturbances at distances smaller than the minimum observed separations in space and time. A Bayesian spatial model can now be constructed from (1) as

[TABLE]

where $y=(y(\ell_{1}),y(\ell_{2}),\ldots,y(\ell_{n}))^{\top}$ is the $n\times 1$ vector of observed outcomes, $X$ is the $n\times p$ matrix of regressors with $i$ -th row $x^{\top}(\ell_{i})$ and the noise covariance matrix $D(\tau)$ represents measurement error or micro-scale variation and depends upon a set of variance parameters $\tau$ . A common specification is $D_{\tau}=\tau^{2}I_{n}$ , where $\tau^{2}$ is called the “nugget.” The hierarchy is completed by assigning prior distributions to $\beta$ , $\theta$ and $\tau$ .

Bayesian inference can proceed by sampling from the joint posterior density in (2) using, for example, Markov chain Monte Carlo (MCMC) methods (see, e.g., Robert and Casella, 2004). A major computational bottleneck emerges from the size of $K_{\theta}$ in computing (2). Since $\theta$ is unknown, each iteration of the model fitting algorithm will involve decomposing or factorizing $K_{\theta}$ , which typically requires $\sim n^{3}$ floating point operations (flops). Memory requirements are of the order $\sim n^{2}$ . These become prohibitive for large values of $n$ when $K_{\theta}$ has no exploitable structure. Evidently, multivariate process settings, where $y(\ell)$ is a $q\times 1$ vector of outcomes, exacerbate the computational burden by a factor of $q$ . For Gaussian likelihoods, one can integrate out the random effects $w$ from (2). This reduces the parameter space to $\{\tau^{2},\theta,\beta\}$ , but one still needs to work with $K_{\theta}+\tau^{2}I_{n}$ , which is again $n\times n$ . These settings are referred to as “big-n” or “high-dimensional” problems in geostatistics and are widely encountered in environmental sciences today.

As modern data technologies are acquiring and exploiting massive amounts of spatiotemporal data, modeling and inference for large spatiotemporal datasets are receiving increased attention. In fact, it is impossible to provide a comprehensive review of all existing methods for geostatistical models for massive spatial data sets; Sun et al. (2011) offers an excellent review for a number of methods for high-dimensional geostatistics. The ideas at the core of fitting models for large spatial and spatiotemporal data concern effectively solving positive definite linear systems such as $Ax=b$ , where $A$ is a covariance matrix. Thus one can use probability models to build computationally efficient covariance matrices. One approach is to approximate or model $A$ with a covariance structure that can significantly reduce the computational burden. An alternative is to model $A^{-1}$ itself with an exploitable structure so that the solution $A^{-1}b$ is available without computing the inverse. For full Bayesian inference, one also needs to ensure that the determinant of $A$ is available easily.

We remark that when inferring about stochastic processes, it is also possible to work in the spectral domain. This rich, and theoretically attractive, option has been advocated by Stein (1999) and Fuentes (2007) and completely avoids expensive matrix computations. The underlying idea is to transform to the space of frequencies, construct a periodogram (an estimate of the spectral density), and exploit the Whittle likelihood (see, e.g., Whittle, 1954; Guyon, 1995) in the spectral domain as an approximation to the data likelihood in the original domain. The Whittle likelihood requires no matrix inversion so, as a result, computation is very rapid. In principle, inversion back to the original space is straightforward. However, there are practical impediments. First, there is discretization to implement a fast Fourier transform whose performance can be tricky over large irregular domains. Predictive inference at arbitrary locations also will not be straightforward. Other issues include arbitrariness to the development of a periodogram. Empirical experience is employed to suggest how many low frequencies should be discarded. Also, there is concern regarding the performance of the Whittle likelihood as an approximation to the exact likelihood. While this approximation is reasonably well centered, it does an unsatisfactory job in the tails (thus leading to poor estimation of model variances). Lastly, modeling non-Gaussian first stages will entail unobservable random spatial effects, making the implementation impossible. In summary, use of the spectral domain with regard to handling large $n$ , while theoretically attractive, has limited applicability.

Broadly speaking, model-based approaches for large spatial datasets proceeds from either exploiting “low-rank” models or exploiting “sparsity”. The former attempts to construct Gaussian processes on a lower-dimensional subspace (see, e.g., Wikle and Cressie, 1999; Higdon, 2002a; Kammann and Wand, 2003; Quinoñero and Rasmussen, 2005; Stein, 2007; Gramacy and Lee, 2008; Stein, 2008; Cressie and Johannesson, 2008; Banerjee et al., 2008; Crainiceanu et al., 2008; Sansó et al., 2008; Finley et al., 2009a; Lemos and Sansó, 2009; Cressie et al., 2010) in spatial, spatiotemporal and more general Gaussian process regression settings. Sparse approaches include covariance tapering (see, e.g., Furrer et al., 2006; Kaufman et al., 2008; Du et al., 2009; Shaby and Ruppert, 2012) using compactly supported covariance functions. This is effective for parameter estimation and interpolation of the response (“kriging”), but it has not been fully evaluated for fully Bayesian inference on residual or latent processes. Introducing sparsity in $K_{\theta}^{-1}$ is prevalent in approximating Gaussian process likelihoods using Markov random fields (e.g., Rue and Held, 2005), products of lower dimensional conditional distributions (Vecchia, 1988, 1992; Stein et al., 2004), or composite likelihoods (e.g., Bevilacqua and Gaetan, 2014; Eidsvik et al., 2014).

This article aims to provide a focused review of some massively scalable Bayesian hierarchical models for spatiotemporal data. The aim is not to provide a comprehensive review of all existing methods. Instead, we focus upon two fully model-based approaches that can be easily embedded within hierarchical models and deliver full Bayesian inference. These are low-rank processes and sparsity-inducing processes. Both these processes can be used as “priors” for spatiotemporal random fields. Here is a brief outline of the paper. Section 2 discusses a Bayesian hierarchical framework for low-rank models and their implementation. Section 3 discusses some recent developments in sparsity-inducing Gaussian processes, especially nearest-neighbor Gaussian processes, and their implementation. Finally, Section 4 provides a brief account of outstanding issues for future research.

2 Hierarchical low-rank models

A popular way of dealing with large spatial datasets is to devise models that bring about dimension reduction (Wikle and Cressie, 1999). A low rank or reduced rank specification is typically based upon a representation or approximation in terms of the realizations of some latent process over a smaller set of points, often referred to as knots. To be precise,

[TABLE]

where $z(\ell)$ is a well-defined process and $b_{\theta}(s,s^{\prime})$ is a family of basis functions possibly depending upon some parameters $\theta$ . The collection of $r$ locations $\{\ell_{1}^{*},\ell^{*}_{2},\ldots,\ell^{*}_{r}\}$ are the knots, $b_{\theta}(\ell)$ and $z$ are $r\times 1$ vectors with components $b_{\theta}(\ell,\ell_{j}^{*})$ and $z(\ell_{j}^{*})$ , respectively. For any collection of $n$ points, the $n\times 1$ vector $\tilde{w}=(\tilde{w}(\ell_{1}),\tilde{w}(\ell_{2}),\ldots,\tilde{w}(\ell_{n}))^{\top}$ is represented as $\tilde{w}=B_{\theta}z$ , where $B_{\theta}$ is $n\times r$ with $(i,j)$ -th element $b_{\theta}(\ell_{i},\ell_{j}^{*})$ . Irrespective of how big $n$ is, we now have to work with the $r$ (instead of $n$ ) $z(\ell_{j}^{*})$ ’s and the $n\times r$ matrix $B_{\theta}$ . Since we anticipate $r<<n$ , the consequential dimension reduction is evident and, since we will write the model in terms of the $z$ ’s (with the $\tilde{w}$ ’s being deterministic from the $z$ ’s, given $b_{\theta}(\cdot,\cdot)$ ), the associated matrices we work with will be $r\times r$ . Evidently, $\tilde{w}(\ell)$ as defined in (3) spans only an $r$ -dimensional space. When $n>r$ , the joint distribution of ${\tilde{w}}$ is singular. However, we do create a valid stochastic process with covariance function

[TABLE]

where $V_{z}$ is the variance-covariance matrix (also depends upon parameter $\theta$ ) for $z$ . From (4), we see that, even if $b_{\theta}(\cdot,\cdot)$ is stationary, the induced covariance function is not. If the $z$ ’s are Gaussian, then $\tilde{w}(\ell)$ is a Gaussian process. Every choice of basis functions yields a process and there are too many choices to enumerate here. Wikle (2010) offers an excellent overview of low rank models.

Different families of spatial models emerge from different specifications for the process $z(\ell)$ and the basis functions $b_{\theta}(\ell,\ell^{\prime})$ . In fact, (3) can be used to construct classes of rich and flexible processes. Furthermore, such constructions need not be restricted to low rank models. If dimension reduction is not a concern, then full rank models can be constructed by taking $r=n$ basis functions in (3). A very popular specification for $z(\ell)$ is a white noise process so that $z\sim N(0,\sigma^{2}I_{n})$ , whereupon (4) simplifies to $\sigma^{2}b_{\theta}(\ell)^{\top}b_{\theta}(\ell^{\prime})$ . A natural choice for the basis functions is a kernel function, say $b_{\theta}(\ell,\ell^{\prime})=K_{\theta}(\ell-\ell^{\prime})$ , which puts more weight on $\ell^{\prime}$ near $\ell$ . Variants of this form have been called “moving average” models and explored by Barry and Ver Hoef (1996), while the term “kernel convolution” has been used in a series of papers by Higdon and collaborators (Higdon, 1998; Higdon et al., 1999; Higdon, 2002b) to not only achieve dimension reduction, but also model nonstationary and multivariate spatial processes. The kernel (which induces a parametric covariance function) can depend upon parameters and might even be spatially varying (Higdon, 2002b; Paciorek and Schervish, 2006). Sansó et al. (2008) use discrete kernel convolutions of independent processes to construct two different class of computationally efficient spatiotemporal processes.

Some choices of basis functions can be more computationally efficient than others depending upon the specific application. For example, Cressie and Johannesson (2008) (also see Shi and Cressie (2007)) discuss “Fixed Rank Kriging” (FRK) by constructing $B_{\theta}$ using very flexible families of non-stationary covariance functions to carry out high-dimensional kriging, Cressie et al. (2010) extend FRK to spatiotemporal settings calling the procedure “Fixed Rank Filtering” (FRF), Katzfuss and Cressie (2012) provide efficient constructions for $B_{\theta}$ for massive spatiotemporal datasets, and Katzfuss (2013) uses spatial basis functions to capture medium to long range dependence and tapers the residual $w(\ell)-\tilde{w}(\ell)$ to capture fine scale dependence. Multiresolution basis functions (see, e.g., Nychka et al., 2002, 2015) have been shown to be effective in building computationally efficient nonstationary models. These papers amply demonstrate the versatility of low-rank approaches using different basis functions.

A different approach is to specify the $z(\ell)$ as a spatial process model having a selected covariance function. This process is called the parent process and one can derive a low-rank process $\tilde{w}(\ell)$ from the parent process. For example, one could use the Karhunen-Loeve (infinite) basis expansion for a Gaussian process (see, e.g., Rasmussen and Williams, 2005; Banerjee et al., 2014) and truncate it to a finite number of terms to obtain a low-rank process. Another example is to project the realizations of the parent process onto a lower-dimensional subspace, which yields the predictive process and its variants; see Section 2.2 for details.

The idea underlying low-rank dimension reduction is not dissimilar to Bayesian linear regression. For example, consider a simplified version of the hierarchical model in (2), where $\beta=0$ and the process parameters $\{\theta,\tau\}$ are fixed. A low rank version of (2) is obtained by replacing $w$ with $B_{\theta}z$ , so the joint distribution is

[TABLE]

where $y$ is $n\times 1$ , $z$ is $r\times 1$ , $D_{\tau}$ and $V_{z}$ are positive definite matrices of sizes $n\times n$ and $r\times r$ , respectively, and $B_{\theta}$ is $n\times r$ . The low rank specification is accommodated using $B_{\theta}z$ and the prior on $z$ , while $D_{\tau}$ (usually diagonal) has the residual variance components. By computing the marginal covariance matrix $\mbox{var}\{y\}$ in two ways (Lindley and Smith, 1972), one arrives at the well-known Sherman-Woodbury-Morrison formula

[TABLE]

The above formula reveals dimension reduction in terms of the marginal covariance matrix for $y$ . If $D_{\tau}$ is easily invertible (e.g., diagonal), then the inverse of an $n\times n$ covariance matrix of the form $D_{\tau}+B_{\theta}V_{z}B_{\theta}^{\top}$ can be computed efficiently using the right-hand-side which only involves inverses of $r\times r$ matrices and $D_{\tau}^{-1}$ . A companion formula for (6) is that for the determinant,

[TABLE]

which shows that the determinant of the $n\times n$ matrix can be computed as a product of the determinants of two $r\times r$ matrices and that of $D_{\tau}$ .

In practical Bayesian computations, however, it is less efficient to directly use the formulas in (6) and (7). Since both the inverse and the determinant are needed, it is more useful to compute the Cholesky decomposition of the covariance matrix. In fact, one can avoid (6) completely and resort to a common trick in hierarchical models (see, e.g., Gelman et al., 2013) and smoothed ANOVA (Hodges, 2013) that expresses (5) as the linear model

[TABLE]

$V_{z}^{1/2}$ and $D_{\tau}^{1/2}$ are matrix square roots of of $V_{z}$ and $D_{\tau}$ , respectively. For example, in practice $D_{\tau}$ is diagonal so $D_{\tau}^{1/2}$ is simply the square root of the diagonal elements of $D_{\tau}$ , while $V_{z}^{1/2}$ is the triangular (upper or lower) Cholesky factor of the $r\times r$ matrix $V_{z}$ . The marginal density of $p(y_{\ast}\,|\,\theta,\tau)$ after integrating out $z$ now corresponds to the linear model $y_{\ast}=B_{\ast}\hat{z}+e_{\ast}$ , where $\hat{z}$ is the ordinary least-square estimate of $z$ . Such computations are easily conducted in statistical programming environments such as R by applying the chol function to obtain the Cholesky factor $V_{z}^{1/2}$ , a backsolve function to efficiently obtain $V_{z}^{-1/2}z$ in constructing (10), and an lm function to compute the least squares estimate of $z$ using the QR decomposition of the design matrix $B_{\ast}$ . We discuss implementation of low rank hierarchical models in a more general contexts in Section 2.3.

2.1 Biases in low-rank models

Irrespective of the precise specifications, low-rank models tend to underestimate uncertainty (since they are driven by a finite number of random variables), hence, overestimate the residual variance (i.e., the nugget). Put differently, this arises from systemic over-smoothing or model under-specification by the low-rank model when compared to the parent model. For example, if $w(\ell)=\tilde{w}(\ell)+\eta(\ell)$ , where $w(\ell)$ is the parent process and $\tilde{w}(\ell)$ is a low-rank approximation, then ignoring the residual $\eta(\ell)=w(\ell)-\tilde{w}(\ell)$ can result in loss of uncertainty and oversmoothing. In settings where the spatial signal is weak compared to the noise, such biases will be less pronounced. Also, it is conceivable that in certain specific case studies proper choices of basis functions (e.g., multiresolution basis functions) will be able to capture much of the spatial behavior and the effect of the bias will be mitigated. However, in general it will be preferable to develop models that will be able to compensate for the overestimation of the nugget.

This phenomenon, in fact, is not dissimilar to what is seen in linear regression models and is especially transparent from writing the parent likelihood and low-rank likelihood as mixed linear models. To elucidate, suppose, without much loss of generality, that ${\cal U}$ is a set with $n$ points of which the first $r$ act as the knots. Let us write the Gaussian likelihood with the parent process as $N(y\,|\,Bu,\tau^{2}I)$ , where $B$ is the $n\times n$ lower-triangular Cholesky factor of $K_{\theta}$ ( $B=B_{\theta}$ depends on $\theta$ , but we suppress this here) and $u=(u_{1},u_{2},\ldots,u_{n})^{\top}$ is now an $n\times 1$ vector such that $u_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1)$ . Writing $B=[B_{1}:B_{2}]$ , where $B_{1}$ has $r<n$ columns, suppose we derive a low-rank model by truncating to only the first $r$ basis functions. The corresponding likelihood is $N(y\,|\,B_{1}\tilde{u}_{1},\tau^{2}I)$ , where $\tilde{u}_{1}$ is an $r\times 1$ vector whose components are independently and identically distributed $N(0,1)$ variables. Customary linear model calculations reveal that the magnitude of the residual vector from the parent model is given by $y^{\top}(I-P_{B})y$ , while that from the low-rank model is given by $y^{\top}(I-P_{B_{1}})y$ , where $P_{A}$ denotes the orthogonal projector matrix onto the column space of any matrix $A$ . Using the fact that $P_{B}=P_{B_{1}}+P_{[(I-P_{B_{1}})B_{2}]}$ , which is a standard result in linear model theory, we find the excess residual variability in the low-rank likelihood is summarized by $y^{\top}P_{[(I-P_{B_{1}})B_{2}]}y$ which can be substantial when $r$ is much smaller than $n$ .

In practical data analysis, the above phenomenon is usually manifested by an overestimation of the nugget variance as it absorbs the residual variation from the low-rank approximation. Consider the following simple experiment. We simulated a spatial dataset using the spatial regression model in (1) with $n=200$ fixed spatial locations, say $\{\ell_{1},\ell_{2},\ldots,\ell_{n}\}$ , within the unit square, and setting $\beta=0$ , $\tau^{2}=5$ , $w(\ell)\sim GP(0,K_{\theta})$ , where $K_{\theta}(\ell_{i},\ell_{j})=\sigma^{2}\exp(-\phi\|\ell_{i}-\ell_{j}\|)$ with $\sigma^{2}=5$ and $\phi=9$ . We then fit the low rank model (5) with $D=\tau^{2}I_{n\times n}$ , $V=I_{r\times r}$ , and $B$ as the $n\times r$ matrix with $i$ -th row $b^{\top}(\ell_{i})=K_{\theta}(\ell_{i},{\cal U}^{*})K^{-1/2}_{\theta}({\cal U}^{*},{\cal U}^{*})$ , where ${\cal U}^{*}=\{\ell^{*}_{1},\ell_{2}^{*},\ldots,\ell_{r}^{*}\}$ is a set of $r$ knots, $K_{\theta}(\ell_{i},{\cal U}^{*})$ is the $1\times r$ vector with $j$ -th element $K_{\theta}(\ell_{i},\ell_{j}^{*})$ and $K^{-1/2}_{\theta}({\cal U}^{*},{\cal U}^{*})$ is the inverse of the lower-triangular Cholesky factor of the $r\times r$ matrix with elements $K_{\theta}(\ell_{i}^{*},\ell_{j}^{*})$ . This emerges from using low-rank radial basis functions in (3); (see, e.g., Ruppert and Carroll, 2003). We fit $40$ such models increasing $r$ from $5$ to $200$ in steps of $5$ . Figure 1 presents the 95% posterior credible intervals for $\tau^{2}$ . Even with $r=175$ knots for a dataset with just $200$ spatial locations, the estimate of the nugget was significantly different from the true value of the parameter. This indicates that low rank processes may be unable to accurately estimate the nugget from the true process. Also, they will likely produce oversmoothed interpolated maps of the underlying spatial process and impair predictive performance. As one specific example, Table 4 in Banerjee et al. (2008) report less than optimal posterior predictive coverage from a predictive process model (see Section 2.2) with over 500 knots for a dataset comprising 15,000 locations.

Although this excess residual variability can be quantified as above (for any given value of the covariance parameters $\theta$ ), it is less clear how the low-rank likelihood could be modified to compensate for this oversmoothing without adding significantly to the computational burden. Matters are complicated by the fact that expressions for the excess variability will involve the unknown process parameters $\theta$ , which must be estimated. In fact, not all low-rank models deliver a straightforward quantification for this bias. For instance, low-rank models based upon kernel convolutions approximate $w(\ell)$ with $w_{KC}(\ell)=\sum_{j=1}^{n^{\ast}}K_{\theta}(\ell-\ell_{j}^{\ast},\theta)u_{j}$ , where $K_{\theta}(\cdot)$ is some kernel function and $u_{j}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1)$ , assumed to arise from a Brownian motion $U(\omega)$ on $\Re^{2}$ . The difference $w(\ell)-w_{KC}(\ell)$ does not, in general, render a closed form and may be difficult to approximate efficiently.

2.2 Predictive process models and variants

One particular class of low-rank processes have been especially useful in providing easy tractability to the residual process. Let $w(\ell)\sim GP(0,K_{\theta}(\cdot,\cdot))$ and let $w^{*}$ be the $r\times 1$ vector of $w(\ell_{j}^{*})$ ’s over a set ${\cal U}^{*}$ of $r$ knots. The usual spatial interpolant (that leads to “kriging”) at an arbitrary site $\ell$ is

[TABLE]

This single site interpolator, in fact, is a well-defined process $\tilde{w}(\ell)\sim GP(0,\tilde{K}_{\theta}(\cdot,\cdot))$ with covariance function, $\tilde{K}_{\theta}(\ell,\ell^{\prime})=K_{\theta}(\ell;{\cal U}^{*})K^{-1}_{\theta}({\cal U}^{*},{\cal U}^{*})K_{\theta}({\cal U}^{*},\ell^{\prime})$ . We refer to $\tilde{w}(\ell)$ as the predictive process derived from the parent process $w(\ell)$ . The realizations of $\tilde{w}(\ell)$ are precisely the kriged predictions conditional upon a realization of $w(\ell)$ over ${\cal U}^{\ast}$ . The process is completely specified given the covariance function of the parent process and the set of knots, ${\cal U}^{\ast}$ . The corresponding basis functions in (3) are given by $b^{\top}_{\theta}(\ell)=K_{\theta}(\ell,{\cal U}^{*})K_{\theta}^{-1}({\cal U}^{*},{\cal U}^{*})$ . These methods have are referred to as subset of regressors in Gaussian process regressions for large data sets in machine learning (Quinoñero and Rasmussen, 2005; Rasmussen and Williams, 2005). Banerjee et al. (2008) coined the term predictive process (as the process could be derived from kriging equations) and developed classes of scalable Bayesian hierarchical spatial process models by replacing the parent process with its predictive process counterpart. An alternate derivation is available by truncating the Karhunen-Loeve (infinite) basis expansion for a Gaussian process to a finite number of terms and solving (approximately) the integral eigen-system equation for $K_{\theta}(\ell,\ell^{\prime})$ by an approximate linear system over the set of knots (see, e.g., Rasmussen and Williams, 2005; Sang and Huang, 2012; Banerjee et al., 2014).

Exploiting elementary properties of conditional expectations, we obtain

[TABLE]

which implies that $\mbox{var}\{w(\ell)\}\geq\mbox{var}\{\tilde{w}(\ell)\}$ and the variance of $\eta(\ell)=w(\ell)-\tilde{w}(\ell)$ is simply the difference of the variances. For Gaussian processes, we get the following closed form for $\mbox{Cov}\{\eta(\ell),\eta(\ell^{\prime})\}$ ,

[TABLE]

Therefore, $\mbox{var}\{\eta(\ell)\}=K_{\eta,\theta}(\ell,\ell)$ , which we denote as $\delta^{2}(\ell)$ .

Perhaps the simplest way to remedy the bias in the predictive process is to approximate the residual process $\eta(\ell)$ with a heteroskedastic process $\tilde{\epsilon}(\ell)\stackrel{{\scriptstyle ind}}{{\sim}}N(0,\delta^{2}(\ell))$ . We construct a modified or bias-adjusted predictive process as

[TABLE]

where $\tilde{\epsilon}(\ell)$ is independent of $\tilde{w}(\ell)$ . It is easy to see that $\mbox{var}\{\tilde{w}_{\epsilon}(\ell)\}=\mbox{var}\{w(\ell)\}$ , so the variance of the two processes are the same. Also, the remedy is computationally efficient – adding an independent space-varying nugget does not incur substantial computational expense. Finley et al. (2009b) offer computational details for the modified predictive process, while Banerjee et al. (2010) show the effectiveness of the bias adjustment in mitigating the effect exhibited in Figure 1 and in estimating multiple variance components in the presence of different structured random effects.

We present a brief simulation example revealing the benefits of the modified predictive process. We generate 2000 locations within a $[0,100]\times[0,100]$ square and then generate the outcomes from (1) using only an intercept as the regressor, an exponential covariance function with range parameter $\phi=0.06$ (i.e., such that the spatial correlation is $\sim 0.05$ at 50 distance units), scale $\sigma^{2}=1$ for the spatial process, and with nugget variance $\tau^{2}=1$ . We then fit the predictive process and modified predictive process models derived from (1) using a hold out set of randomly selected sites, along with a separate set of regular lattices for the knots ( $m=49$ , $144$ and $900$ ). Table 1 shows the posterior estimates and the square roots of MSPE based on the predictions for the hold-out data. The overestimation of $\tau^{2}$ by the predictive process is apparent and we also see how the modified predictive process is able to adjust for the $\tau^{2}$ . Not surprisingly, the RMSPE is essentially the same under either process model.

Further enhancements to the modified predictive process are possible. Since the modified predictive process adjusts only the variance, information in the covariance induced by the residual process $\eta(\ell)$ is lost. One alternative is to use the so called “full scale approximation” proposed by Sang et al. (2011) and Sang and Huang (2012), where $\eta(\ell)$ is approximated by a tapered process, say $\eta_{tap}(\ell)$ . The covariance function for $\eta(\ell)$ is of the form $K_{\eta,\theta}(\ell,\ell^{\prime})K_{tap,\nu}(\|\ell-\ell^{\prime}\|)$ , where $K_{\eta,\theta}(\ell,\ell^{\prime})$ is as in (13) and $K_{tap,\nu}(\|\ell-\ell^{\prime}\|)$ is a compactly supported covariance function that equals [math] beyond a distance $\nu$ (see, e.g., Furrer et al., 2006, for some practical choices.). This full scale approximation is also able to more effectively capture small scale dependence. Katzfuss (2013) extended some of these ideas by modeling the spatial error as a combination of a low-rank component designed to capture medium to long-range dependence and a tapered component to capture local dependence.

Perhaps the most promising use of the predictive process, at least in terms of scalability to massive spatial datasets, is the recent multiresolution approximation proposed by Katzfuss (2017). Instead of approximating the residual process $\eta(\ell)$ in one step, the idea here is to partition the spatial domain recursively and construct a sequence of approximations. We start by partitioning the domain of interest ${\cal L}$ into $J$ non-intersecting subregions, say ${\cal L}_{1},{\cal L}_{2},\ldots,{\cal L}_{J}$ , such that ${\cal L}=\cup_{j=1}^{J}{\cal L}_{j}$ . We call the ${\cal L}_{j}$ ’s level-1 subregions. We fix a set of knots in ${\cal L}$ and write the parent process as $w(\ell)=\tilde{w}(\ell)+\eta(\ell)$ , where $\tilde{w}(\ell)$ is the predictive process as in (11) and $\eta(\ell)$ is the residual Gaussian process with covariance function given by (13). At resolution 1, we replace $\eta(\ell)$ with a block-independent process $\eta_{1}(\ell)$ such that $\mbox{Cov}\{\eta_{1}(\ell),\eta_{1}(\ell^{\prime})\}=0$ if $\ell$ and $\ell^{\prime}$ are not in the same subregion and is equal to (13) if $\ell$ and $\ell^{\prime}$ are in the same subregion.

At the second resolution, each ${\cal L}_{j}$ is partitioned into a set of disjoint subregions ${\cal L}_{j1},{\cal L}_{j2},\ldots,{\cal L}_{jm}$ . We call these the level-2 subregions and choose a set of knots within each. We approximate $\eta_{1}(\ell)\approx\tilde{\eta_{1}}(\ell)+\eta_{2}(\ell)$ , where $\tilde{\eta_{1}}(\ell)$ is the predictive process derived from $\eta_{1}(\ell)$ using the knots in ${\cal L}_{j}$ if $\ell\in{\cal L}_{j}$ and $\eta_{2}(\ell)$ is the analogous block-independent approximation across the subregions within each ${\cal L}_{j}$ . Thus, $\mbox{Cov}\{\eta_{2}(\ell),\eta_{2}(\ell^{\prime})\}=0$ if $\ell$ and $\ell^{\prime}$ are not in the same level-2 subregion and will equal $\mbox{Cov}\{\eta_{1}(\ell),\eta_{1}(\ell^{\prime})\}$ when $\ell$ and $\ell^{\prime}$ are in the same level-2 subregion. At resolution 3 we partition each of the level-2 subregions into level-3 subregions and continue the approximation of the residual process from the predictive process. At the end of $M$ resolutions, we arrive at the mult-resolution predictive process $\tilde{w}_{M}(\ell)=\tilde{w}(\ell)+\sum_{i=1}^{M-1}\tilde{\eta}_{i}(\ell)+\eta_{M}(\ell)$ , which, by construction, is a valid Gaussian process. The computational complexity with the multi-resolution predictive process is $\sim O(nM^{2}r^{2})$ , where $M$ is the number of resolutions and $r$ is the number of knots chosen within each subregion.

To summarize, we do not recommend the use of just a reduced/low rank model. To improve performance, it is necessary to approximate the residual process and, in this regard, the predictive process is especially attractive since the residual process is available explicitly.

2.3 Bayesian implementation for low-rank models

A very rich and flexible class of spatial and spatiotemporal models emerge from the hierarchical linear mixed model

[TABLE]

where $y$ is an $n\times 1$ vector of possibly irregularly located observations, $X$ is a known $n\times p$ matrix of regressors ( $p<n$ ), $V_{u,\theta}$ and $D_{\tau}$ are families of $r\times r$ and $n\times n$ covariance matrices depending on unknown process parameters $\theta$ and $\tau$ , respectively, and $B_{\theta}$ is $n\times r$ with $r\leq n$ . The low-rank models in (3) emerge when $r<<n$ and $B_{\theta}$ is the matrix obtained by evaluating the basis functions. Proper prior distributions $p(\theta)$ and $p(\tau)$ for $\theta$ and $\tau$ , respectively, complete the hierarchical specification.

Bayesian inference proceeds, customarily, by sampling $\{\beta,z,\theta,\tau\}$ from (15) using Markov chain Monte Carlo (MCMC) methods. For faster convergence, we integrate out $z$ from the model and first sample from $\displaystyle p(\theta,\tau,\beta\,|\,y)\propto p(\theta)\times p(\tau)\times N(\beta\,|\,\mu_{\beta},V_{\beta})\times N(y\,|\,X\beta,\Sigma_{y\,|\,\theta,\tau})$ , where $\Sigma_{y\,|\,\theta,\tau}=B_{\theta}V_{z,\theta}B_{\theta}^{\top}+D_{\tau}$ . Working directly with $\Sigma_{y\,|\,\theta,\tau}$ will be expensive. Usually $D_{\tau}$ is diagonal or sparse, so the expense is incurred from the matrix $B_{\theta}V_{z,\theta}B_{\theta}^{\top}$ . Assuming that $B_{\theta}$ and $V_{z,u}$ are computationally inexpensive to construct for each $\theta$ and $\tau$ , $B_{\theta}V_{z,\theta}B_{\theta}^{\top}$ requires $\sim O(rn^{2})$ flops. Using the Sherman-Woodbury-Morrison formula in (6) will avoid constructing $B_{\theta}V_{z,\theta}B_{\theta}^{\top}$ or inverting any $n\times n$ matrix. However, in practice it is better to not directly compute the right hand side of (6) as it involves some redundant matrix multiplications. Furthermore, we wish to obtain the determinant of $\Sigma_{y\,|\,\theta,\tau}$ cheaply. These are efficiently accomplished as outlined below.

The primary computational bottleneck lies in evaluating the multivariate Gaussian likelihood $N(y\,|\,X\beta,\Sigma_{y\,|\,\theta,\tau})$ which is required for updating the parameters $\{\theta,\tau\}$ (e.g., using random-walk Metropolis or Hamiltonian Monte Carlo steps). We can accomplish this effectively using two functions: $L=\texttt{chol}(V)$ which computes the Cholesky factorization for any positive definite matrix $V=LL^{\top}$ , where $L$ is lower-triangular, and $W=\texttt{trsolve}(T,B)$ which solves the triangular system $TW=B$ for a triangular (lower or upper) matrix $T$ . We first compute

[TABLE]

where $H$ is obtained by first computing $W=D^{-1/2}B_{\theta}$ , then the Cholesky factorization $L=\texttt{chol}(V_{z,\theta}^{-1}+W^{\top}W)$ , and finally solve the triangular system $H=\texttt{trsolve}(L,W^{\top})$ . Having obtained $H$ , we compute $e=y-X\beta$ , $m_{1}=D^{-1/2}e$ , $m_{2}=Hm_{1}$ , and obtain $T=\texttt{chol}(I_{r}-HH^{\top})$ . The log-target density for $\{\theta,\tau\}$ is then computed as

[TABLE]

where $d_{ii}$ ’s and $t_{ii}$ ’s are the diagonal elements of $D_{\tau}$ and $T$ , respectively. The total number of flops required for evaluating the target is $O(nr^{2}+r^{3})\approx O(nr^{2})$ (since $r<<n$ ) which is considerably cheaper than the $O(n^{3})$ flops that would have been required for the analogous computations in a full Gaussian process model. In practice, Gaussian proposal distributions are employed for the Metropolis algorithm and all parameters with positive support are transformed to their logarithmic scale. Therefore, the necessary Jacobian adjustments are made to (17) by adding some scalar quantities with negligible computational costs.

Starting with initial values for all parameters, each iteration of the MCMC executes the above calculations to provide a sample for $\{\theta,\tau\}$ . The regression parameter $\beta$ is then sampled from its full conditional distribution. Writing $\Sigma_{y}=B_{\theta}V_{z,\theta}B_{\theta}^{\top}+D_{\tau}$ as in (16), the full conditional distribution for $\beta$ is $N(Aa,A)$ , where $A^{-1}=\Sigma_{\beta}^{-1}+X^{\top}\Sigma_{y}^{-1}X$ and $a=\Sigma_{\beta}^{-1}\mu_{\beta}+X^{\top}\Sigma_{y}^{-1}y$ . These are efficiently computed as $[f:F]=D^{-1/2}[y:X]$ , $\tilde{F}=HF$ and setting $a=\Sigma_{\beta}^{-1}\mu_{\beta}+F^{\top}f-\tilde{F}^{\top}Hf$ and $L=\texttt{chol}(\Sigma_{\beta}^{-1}+F^{\top}F-\tilde{F}^{\top}\tilde{F})$ . We then compute $\beta=\texttt{trsolve}(L^{\top},\texttt{trsolve}(L,a))+\texttt{trsolve}(L,\tilde{Z})$ , where $\tilde{Z}$ is a conformable vector of independent $N(0,1)$ variables.

We repeat the above computations for each iteration of the MCMC algorithm using the current values of the process parameters in $\Sigma_{y}$ . The algorithm described above will produce, after convergence, posterior samples for $\Omega=\{\theta,\tau,\beta\}$ . We then sample from the posterior distribution $\displaystyle p(z\,|\,y)=\int p(z\,|\,\Omega,y)p(\Omega\,|\,y)d\Omega$ , where $p(z\,|\,\Omega,y)=N(z\,|\,Aa,A)$ with $A=(V_{z,\theta}^{-1}+B_{\theta}^{\top}D_{\tau}^{-1}B_{\theta})^{-1}$ and $a=B_{\theta}^{\top}D_{\tau}^{-1}(y-X\beta)$ . For each $\Omega$ drawn from $p(\Omega\,|\,y)$ we will need to draw a corresponding $z$ from $N(z\,|\,Aa,A)$ . This will involve $\texttt{chol}(A)$ . Since the number of knots $r$ is usually fixed at a value much smaller than $n$ , obtaining $\texttt{chol}(A)$ is $\sim O(r^{3})$ and not as expensive. However, it will involve the inverse of $V_{z,\theta}$ , which is computed using $\texttt{chol}(V_{z,\theta})$ and can be numerically unstable for certain smoother covariance functions such as the Gaussian or the Matérn with large $\nu$ . A numerically more stable algorithm exploits the relation $A=Q-Q(V_{z,\theta}+Q)^{-1}Q$ , where $Q^{-1}=B_{\theta}^{\top}D_{\tau}^{-1}B_{\theta}$ . For each $\Omega$ sampled from $p(\Omega\,|\,y)$ , we compute $L=\texttt{chol}(V_{z,\theta}+Q)$ , $W=\texttt{trsolve}(L,Q)$ and $L=Q-W^{\top}W$ . We generate an $r\times 1$ vector $Z^{*}\sim N(0,I_{r})$ and set $z=L(Z^{*}+L^{\top}a)$ . Repeating this for each $\Omega$ drawn from $p(\Omega\,|\,y)$ produces a sample of $z$ ’s from $p(z\,|\,y)$ .

Finally, we seek predictive inference for $y(\ell_{0})$ at any arbitrary space-time coordinate $\ell_{0}$ . Given $x^{\top}(\ell_{0})$ , we draw $y(\ell_{0})\sim N\left(x^{\top}(\ell_{0})\beta+b_{\theta}^{\top}(\ell_{0})z,\tau^{2}\right)$ for every posterior sample of $\Omega$ and $z$ . This yields the corresponding posterior predictive samples for $z(\ell_{0})$ and $y(\ell_{0})$ . Posterior predictive samples of the latent processes can also be easily computed as $z(\ell_{0})=b_{\theta}^{\top}(\ell_{0})z$ for each posterior sample of the $z$ and $\theta$ . Posterior predictive distributions at any of the observed $\ell_{i}$ ’s yield replicated data (see, e.g., Gelman et al., 2013) that can be used for model assessment and comparisons. Finley et al. (2015) provide more extensive implementation details for models such as (15) in the context of the spBayes package in R.

3 Sparsity-inducing nearest-neighbor Gaussian processes

Low-rank models have been, and continue to be, widely employed for analyzing spatial and spatiotemporal data. The algorithmic cost for fitting low-rank models typically decrease from $O(n^{3})$ to $O(nr^{2}+r^{3})\approx O(nr^{2})$ flops since $n>>r$ . However, when $n$ is large, empirical investigations suggest that $r$ must be fairly large to adequately approximate the parent process and the $nr^{2}$ flops become exorbitant. Furthermore, low-rank models can perform poorly depending upon the smoothness of the underlying process or when neighboring observations are strongly correlated and the spatial signal dominates the noise (Stein, 2014).

As an example, consider part of the simulation experiment presented in Datta et al. (2016a), where a spatial random field was generated over a unit square using a Gaussian process with fixed spatial process parameters over a set of $2500$ locations. We then fit a full Gaussian process model and a predictive process model with $64$ knots. Figure 2 presents the results (see, e.g., Datta et al., 2016a, for details.) While the estimated random field from the full Gaussian process is almost indistinguishable from the true random field, the surface obtained from the predictive process with $64$ locations substantially oversmooths. This oversmoothing can be ameliorated by using a larger number of knots, but that adds to the computational burden.

Figure 2 serves to reinforce findings that low-rank models may be limited in their ability to produce accurate representation of the underlying process at massive scales. They will need a considerably larger number of basis functions to capture the features of the process and will require substantial computational resources for emulating results from a full GP. As the demands for analyzing large spatial datasets increase from the order of $\sim 10^{4}$ to $\sim 10^{6}$ locations, low-rank models may struggle to deliver acceptable inference. In this regard, enhancements such as the multi-resolution predictive process approximations referred to in Section 2.2 are highly promising.

An alternative is to develop full rank models that can exploit sparsity. Instead of deriving basis approximations for $w$ , one could achieve computational gains by modeling either its covariance function or its inverse as sparse. Covariance tapering does the former by modeling $\mbox{var}\{w\}=K_{\theta}\odot K_{\mbox{tap},\nu}$ , where $K_{\mbox{tap},\nu}$ is a sparse covariance matrix formed from a compactly supported, or tapered, covariance function with tapering parameter $\nu$ and $\odot$ denotes the element wise (or Hadamard) product of two matrices. The Hadamard product of two positive definite matrices is again a positive definite matrix, so $K_{\theta}\odot K_{\mbox{tap},\nu}$ is positive definite. Furthermore, $K_{\mbox{tap},\nu}$ is sparse because a tapered covariance function is equal to [math] for all pairs of locations separated by a distance beyond a threshold $\nu$ . We refer the reader to Furrer et al. (2006), Kaufman et al. (2008) and Du et al. (2009) for further computational and theoretical details on covariance tapering. Covariance tapering is undoubtedly an attractive approach for constructing sparse covariance matrices, but its practical implementation for full Bayesian inference will generally require efficient sparse Cholesky decompositions, numerically stable determinant computations and, perhaps most importantly, effective memory management. These issues are yet to be tested for truly massive spatiotemporal datasets with $n\sim 10^{5}$ or more.

Another way to exploit sparsity is to model the inverse of $\mbox{var}\{w\}$ as a sparse matrix. For finite-dimensional distributions conditional and simultaneous autoregressive (CAR and SAR) models (see, e.g., Cressie, 1993; Banerjee et al., 2014, and references therein) adopt this approach for areally referenced datasets. More generally, Gaussian Markov random fields or GMRFs (see, e.g., Rue and Held, 2005) are widely used tools for constructing sparse precision matrices and have led to computational algorithms such as the Integrated Nested Laplace Approximation (INLA) developed by Rue et al. (2009). A subsequent article by Lindgren et al. (2011) show how Gaussian processes can be approximated by GMRFs using computationally efficient sparse representations. Thus, a Gaussian process model with a dense covariance function is approximated by a GMRF with a sparse precision matrix. The approach is very computationally efficient for certain classes of covariance functions generated by a certain class of stochastic partial differential equations (including the versatile Matérn class), but their inferential performance on unobservable spatial, spatiotemporal or multivariate Gaussian processes (perhaps specified through more general covariance or cross-covariance functions) embedded within Bayesian hierarchical models is yet to be assessed.

Rather than working with approximations to the process, one could also construct massively scalable sparsity-inducing Gaussian processes that can be conveniently embedded within Bayesian hierarchical models and deliver full Bayesian inference for random fields at arbitrary resolutions. Section 3.1 describes how sparsity is introduced in the precision matrices for graphical Gaussian models by exploiting the relationship between the Cholesky decomposition of a positive definite matrix and conditional independence. These sparse Gaussian models (i.e., normal distributions with sparse precision matrices) can be used prior models for a finite number of spatial random effects. Section 3.2 shows the construction of a process from these graphical Gaussian models. This process will be a Gaussian process whose finite-dimensional realizations will have sparse precision matrices. We call them Nearest Neighbor Gaussian Processes (NNGP). Finally, Section 3.3 outlines how the process can be embedded within hierarchical models and presents some brief simulation examples demonstrating certain aspects of inference from NNGP models.

3.1 Sparse Gaussian graphical models

Consider the hierarchical model (2) and, in particular, the expensive prior density $N(w\,|\,0,K_{\theta})$ . From the dense covariance matrix $K_{\theta}$ , we wish to obtain a covariance matrix $\tilde{K}_{\theta}$ such that $\tilde{K}_{\theta}^{-1}$ is sparse and, importantly, its determinant is available cheaply. What would be an effective way of achieving this? One approach would be to consider modeling the Cholesky decomposition of the precision matrix so that it is sparse. For example, forcing some elements in the dense half of the triangular Cholesky factor to be zero will introduce sparsity in the precision matrix. To precisely set out which elements should be made zero in the Cholesky factor, we borrow some fundamental notions of sparsity from graphical (Gaussian) models.

The underlying idea is, in fact, ubiquitous in graphical models or Bayesian networks (see, e.g., Lauritzen, 1996; Bishop, 2006; Murphy, 2012). The joint distribution for a random vector $w$ can be looked upon as a directed acyclic graph (DAG) where each node is a random variable $w_{i}$ . We write the joint distribution as

[TABLE]

where $\mbox{Pa}[1]$ is the empty set and $\mbox{Pa}[i]=\{1,2,\ldots,i-1\}$ for $i=2,3,\ldots,n-1$ is the set of parent nodes with directed edges to $i$ . This model is specific to the ordering (sometimes called “topological ordering”) of the nodes. The DAG corresponding to this factorization is shown in Figure 3(a) for $n=7$ nodes. One can refer to this as the full graphical model since $\mbox{Pa}[i]$ comprises all nodes preceding $i$ in the topological order. Shrinking $\mbox{Pa}[i]$ from the set of all nodes preceding $i$ to a smaller subset of parent nodes yields a different, but still valid, joint distribution. In spatial settings, each of the nodes in the DAG have associated spatial coordinates. Thus, the parents for any node $i$ can be chosen to include a certain fixed number of “nearest neighbors”, say based upon their distance from node $i$ . For example, Figure 3(b) shows the DAG when some of the edges are deleted so as to retain at most $3$ nearest neighbors in the conditional probabilities. The resulting joint density is

[TABLE]

The above model posits that any node $i$ , given its parents, is conditionally independent of any other node that is neither its parent nor its child.

Applying the above notion to multivariate Gaussian densities evinces the connection between conditional independence in DAGs and sparsity. Consider an $n\times 1$ random vector $w$ distributed as $N(0,K_{\theta})$ . Writing $N(w\,|\,0,K_{\theta})$ as $p(w_{1})\prod_{i=2}^{n}p(w_{i}\,|\,w_{1},w_{2},\ldots,w_{i-1})$ is equivalent to the following set of linear models,

[TABLE]

or, more compactly, simply $w=Aw+\eta$ , where $A$ is $n\times n$ strictly lower-triangular with elements $a_{ij}=0$ whenever $j\geq i$ and $\eta\sim N(0,D)$ and $D$ is diagonal with diagonal entries $d_{11}=\mbox{var}\{w_{1}\}$ and $d_{ii}=\mbox{var}\{w_{i}\,|\,w_{j}:j<i\}$ for $i=2,\ldots,n$ .

From the structure of $A$ it is evident that $I-A$ is nonsingular and $K_{\theta}=(I-A)^{-1}D(I-A)^{-\top}$ . The possibly nonzero elements of $A$ and $D$ are completely determined by the matrix $K_{\theta}$ . Let a[i,j], d[i,j] and K[i,j] denote the $(i,j)$ -th entries of $A$ , $D$ and $K_{\theta}$ , respectively. Note that d[1,1] = K[1,1] and the first row of $A$ is [math]. A pseudo-code to compute the remaining elements of $A$ and $D$ is:

[TABLE]

Here a[i+1,1:i] is the $1\times\texttt{i}$ row vector comprising the possibly nonzero elements of the i+1-th row of $A$ , K[1:i,1:i] is the $\texttt{i}\times\texttt{i}$ leading principal submatrix of $K_{\theta}$ , K[1:i, i] is the $\texttt{i}\times 1$ row vector formed by the first i elements in the i-th column of $K_{\theta}$ , K[i, 1:i] is the $1\times\texttt{i}$ row vector formed by the first i elements in the i-th row of $K_{\theta}$ , solve(B,b) computes the solution for the linear system Bx = b, and dot(u,v) provides the inner product between vectors u and v. The determinant of $K_{\theta}$ is obtained with almost no additional cost: it is simply $\prod_{\texttt{i=1}}^{\texttt{n}}\texttt{d[i,i]}$ .

The above pseudocode provides a way to obtain the Cholesky decomposition of $K_{\theta}$ . If $K_{\theta}=LDL^{\top}$ is the Cholesky decomposition, then $L=(I-A)^{-1}$ . There is, however, no apparent gain to be had from the preceding computations since one will need to solve increasingly larger linear systems as the loop runs into higher values of i. Nevertheless, it immediately shows how to exploit sparsity if we set some of the elements in the lower triangular part of $A$ to be zero. For example, suppose we set at most m elements in each row of $A$ to be nonzero. Let N[i] be the set of indices $\texttt{j}<\texttt{i}$ such that $\texttt{a[i,j]}\neq 0$ . We can compute the nonzero elements of $A$ and the diagonal elements of $D$ much more efficiently as:

[TABLE]

In (28) we solve n-1 linear systems of size at most $\texttt{m}\times\texttt{m}$ . This can be performed in $\sim\texttt{nm}^{3}$ flops, whereas the earlier pseudocode in (22) for the dense model required $\sim\texttt{n}^{3}$ flops. These computations can be performed in parallel as each iteration of the loop is independent of the others.

The above discussion provides a very useful strategy for introducing sparsity in a precision matrix. Let $K_{\theta}$ and $K^{-1}_{\theta}$ both be dense $n\times n$ positive definite matrices. Suppose we use the pseudocode in (28) with $\texttt{K}=K_{\theta}$ to construct a sparse strictly lower-triangular matrix $A$ with no more than $m$ non-zero entries in each row, where $m$ is considerably smaller than $n$ , and the diagonal matrix $D$ . The resulting matrix $\tilde{K}_{\theta}=(I-A)^{-1}D(I-A)^{-\top}$ is a covariance matrix whose inverse $\tilde{K}^{-1}_{\theta}=(I-A^{\top})D^{-1}(I-A)$ is sparse. Figure 4 presents a visual representation of the sparsity. While $\tilde{K}_{\theta}$ need not be sparse, the density $N(w\,|\,0,\tilde{K}_{\theta})$ is cheap to compute since $\tilde{K}^{-1}_{\theta}$ is sparse and $\det(\tilde{K}_{\theta})=\det(D)=\prod_{i=1}^{n}\texttt{d[i,i]}$ is calculated from (28). Therefore, one way to achieve massive scalability for models such as (2) is to assume that $w$ has prior $N(w\,|\,0,\tilde{K}_{\theta})$ instead of $N(w\,|\,0,K_{\theta})$ .

3.2 From distributions to processes

If we are interested in estimating the spatial or spatiotemporal process parameters from a finite collection of random variables, then we can use the approach in Section 3.1 with $w_{i}:=w(\ell_{i})$ . In spatial settings, matters are especially convenient as we can delete the edges in the DAG based upon the distances among $\ell_{i}$ ’s. In fact, one can decide to retain at most $m$ of the nearest neighbors for each location and delete all remaining edges. This implies that the $(i,j)$ -th element of $A$ in Section 3.1 will be nonzero only if $\ell_{j}$ is one of the $m$ nearest neighbors of $\ell_{i}$ . In fact, this idea has been effectively used to construct composite likelihoods for Gaussian process models by Vecchia (1988) and Stein et al. (2004), while Stroud et al. (2017) exploits this idea to propose preconditioned conjugate gradient algorithms for Bayesian and maximum likelihood estimates on large incomplete lattices.

Localized Gaussian process regression based on few nearest neighbors has also been used to obtain fast kriging estimates. Emery (2009) provides fast updates for kriging equations after adding a new location to the input set. Iterative application of their algorithm yields a localized kriging estimate based on a small set of locations (including few nearest neighbors). The local estimate often provides an excellent approximation to the global kriging estimate which uses data observed at all the locations to predict at a new location. However, this assumes that the parameters associated with the mean and covariance of the GP are known or already estimated. Local Approximation GP, or LAGP (Gramacy and Apley, 2015; Gramacy and Haaland, 2016; Gramacy, 2016), extends this further to estimate the parameters at each new location, essentially providing a non-stationary local approximation to a Gaussian Process at every predictive location and can be used to interpolate or smooth the observed data.

If, however, posterior predictive inference is sought at arbitrary spatiotemporal resolutions, i.e., for the entire process $\{w(\ell):\ell\in{\cal L}\}$ , then the ideas in Section 3.1 need to be extended to process-based models. Recently, Datta et al. (2016a) proposed a Nearest Neighbor Gaussian Process (NNGP) for modeling large spatial data. NNGP is a well defined Gaussian Process over a domain ${\cal L}$ and yields finite dimensional Gaussian densities with sparse precision matrices. This has been also extended to a dynamic NNGP with dynamic neighbor selection for massive spatiotemporal data (Datta et al., 2016b). The NNGP delivers massive scalability both in terms of parameter estimation and kriging. Unlike low rank processes, it does not oversmooth and accurately emulates the inference from full rank GPs.

We will construct the NNGP in two steps. First, we specify a multivariate Gaussian distribution over a fixed finite set $r$ points in ${\cal L}$ , say ${\cal R}=\{\ell_{1}^{*},\ell_{2}^{*},\ldots,\ell_{r}^{*}\}$ , which we call the reference set. The reference set can be very large. It can be a fine grid of points over ${\cal L}$ or one can simply take $r=n$ and let ${\cal R}$ be the set of observed points in ${\cal L}$ . We require that the inverse of the covariance matrix be sparse and computationally efficient. Therefore, we specify that $w_{{\cal R}}\sim N(0,\tilde{K}_{\theta})$ , where $w_{{\cal R}}$ is the $r\times 1$ vector with elements $w(\ell_{i}^{*})$ and $\tilde{K}_{\theta}$ is a covariance matrix such that $\tilde{K}^{-1}_{\theta}$ is sparse. The matrix $\tilde{K}_{\theta}$ is constructed from a dense covariance matrix $K_{\theta}$ as described in Section 3.1. This provides a highly effective approximation (Vecchia, 1988; Stein et al., 2004) as below:

[TABLE]

where history sets $H(\ell_{i}^{*})$ so that $H(\ell^{*}_{1})$ is the empty set and $H(\ell^{*}_{i})=\{\ell^{*}_{1},\ell^{*}_{2},\ldots,\ell^{*}_{i-1}\}$ for $i=2,3,\ldots,r$ and we have much smaller neighbor sets $N(\ell_{i}^{*})\subseteq H(\ell_{i}^{*})$ for each $\ell_{i}^{*}$ in ${\cal R}$ . We have legitimate probability models for any choice of $N(\ell_{i}^{*})$ ’s as long as $N(\ell_{i}^{*})\subseteq H(\ell_{i}^{*})$ . One easy specification is to define $N(\ell^{*}_{i})$ as the set of $m$ nearest neighbors of $\ell_{i}^{*}$ among the points in ${\cal R}$ . Therefore,

[TABLE]

If $m(<<r)$ denotes the limiting size of the neighbor sets $N(\ell)$ , then $\tilde{K}^{-1}_{\theta}$ has at most $O(rm^{2})$ non-zero elements. Hence, the approximation in (29) produces a sparsity-inducing proper prior distribution for random effects over ${\cal R}$ that closely approximates the realizations from a $GP(0,K_{\theta})$ .

To construct the NNGP we extend the above model to arbitrary locations. We define neighbor sets $N(\ell)$ for any $\ell\in{\cal L}$ as the set of $m$ nearest neighbors of $\ell$ in ${\cal R}$ . Thus, $N(\ell)\subseteq{\cal R}$ and the process can be derived from $p(w_{{\cal R}},w(\ell)\,|\,\theta)=N(w_{{\cal R}}\,|\,0,\tilde{K}_{\theta})\times p\left(w(\ell)\,|\,w_{N(\ell)},\theta\right)$ or, equivalently, by writing

[TABLE]

where $a_{i}(\ell)=0$ whenever $\ell_{i}^{*}\notin N(\ell)$ , $\eta(\ell)\stackrel{{\scriptstyle ind}}{{\sim}}N(0,\delta^{2}(\ell))$ is a process independent of $w(\ell)$ , $\mbox{Cov}\{\eta(\ell),\eta(\ell^{\prime})\}=0$ for any two distinct points in ${\cal L}$ , and

[TABLE]

Taking conditional expectations in (30) yields $\mbox{E}[w(\ell)\,|\,w_{N(\ell)}]=\sum_{i:\ell_{i}\in N(\ell)}a_{i}(\ell)w(\ell_{i}^{*})\;,$ which implies that for each $\ell$ the nonzero $a_{i}(\ell)$ ’s are obtained by solving an $m\times m$ linear system. The above construction ensures that $w(\ell)$ is a legitimate Gaussian process whose realizations over any finite collection of arbitrary points in ${\cal L}$ will have a multivariate normal distribution with a sparse precision matrix. More formal developments and technical details in the spatial and spatiotemporal settings can be found in Datta et al. (2016a) and Datta et al. (2016b), respectively.

One point worth considering is the definition of “neighbors.” There is some flexibility here. In the spatial setting, the correlation functions usually decay with increasing inter-site distance, so the set of nearest neighbors based on the inter-site distances represents locations exhibiting highest correlation with the given locations. For example, on the plane one could simply use the Euclidean metric to construct neighbor sets, although Stein et al. (2004) recommends including a few points that are farther apart. The neighbor sets can be fixed before the model fitting exercise.

In spatiotemporal settings, matters are more complicated. Spatiotemporal covariances between two points typically depend on the spatial as well as the temporal lag between the points. Non-separable isotropic spatiotemporal covariance functions can be written as $K_{\theta}((s_{1},t_{1}),(s_{2},t_{2}))=K_{\theta}(h,u)$ where $h=\|s_{1}-s_{2}\|$ and $u=|t_{1}-t_{2}|$ . This often precludes defining any universal distance function $d:({\cal S}\times{\cal T})^{2}\rightarrow\Re^{+}$ such that $K_{\theta}((s_{1},t_{1}),(s_{2},t_{2}))$ will be monotonic with respect to $d((s_{1},t_{1}),(s_{2},t_{2}))$ for all choices of $\theta$ . This makes it difficult to define universal nearest neighbors in spatiotemporal domains. To obviate this hurdle, Datta et al. (2016b) define “nearest neighbors” in a spatiotemporal domain using the spatiotemporal covariance function itself as a proxy for distance. This can work for arbitrary domains. For any three points $\ell_{1}$ , $\ell_{2}$ and $\ell_{3}$ , we say that $\ell_{1}$ is nearer to $\ell_{2}$ than to $\ell_{3}$ if $K_{\theta}(\ell_{1},\ell_{2})>K_{\theta}(\ell_{1},\ell_{3})$ . Subsequently, this definition of “distance” is used to find $m$ nearest neighbors for any location. Prediction at any arbitrary location $\ell\notin{\cal R}$ is performed by sampling from the posterior predictive distribution. However, for every point $\ell_{i}$ , its neighbor set $N_{\theta}(\ell)$ will now depend on $\theta$ and can change from iteration to iteration in the estimation algorithm. If $\theta$ were known, one could have simply evaluated the pairwise correlations between any point $\ell_{i}^{*}$ in ${\cal R}$ and all points in its history set $H(\ell_{i}^{*})$ to obtain $N_{\theta}(\ell_{i}^{*})$ — the set of $m$ true nearest neighbors. In practice, however, $\theta$ is unknown and for every new value of $\theta$ in an iterative algorithm, we need to search for the neighbor sets within the history sets. Since the history sets are quite large, searching the entire space for nearest neighbors in each iteration will be computationally unfeasible. Datta et al. (2016b) offer some smart strategies for selecting spatiotemporal neighbors. They propose restricting the search for the neighbor sets to carefully constructed small subsets of the history sets. These small eligible sets $E(\ell_{i}^{*})$ are constructed in such a manner that, despite being much smaller than the history sets, they are guaranteed to contain the true nearest neighbor sets. This strategy works when we choose $m$ to be a perfect square and the original nonseparable covariance function $K_{\theta}(h,u)$ satisfies natural monotonicity, i.e. $K_{\theta}(h,u)$ is decreasing in $h$ for fixed $u$ and decreasing in $u$ for fixed $h$ . All Matèrn-based space-time separable covariances and many non-separable classes of covariance functions possess this property (Stein, 2013; Omidi and Mohammadzadeh, 2015).

3.3 Hierarchical NNGP models

We briefly turn to model fitting and estimation. For the approximation in (29) to be effective, the size of the reference set, $r$ , needs to be large enough to represent the spatial domain. However, this does not impede computations involving NNGP models because the storage and number of floating point operations are always linear in $r$ . The reference set ${\cal R}$ can, in principle, be any finite set of locations in the study domain. A particularly convenient choice, in practice, is to simply take ${\cal R}$ to be the set of observed locations in the dataset. Datta et al. (2016a) demonstrate through extensive simulation experiments and a real application that this simple choice seems to be very effective.

Since the NNGP is a proper Gaussian process, we can use it as a prior for the spatial random effects in any hierarchical model. We write $w(\ell)\sim NNGP(0,\tilde{K}_{\theta}(\cdot,\cdot))$ , where $\tilde{K}_{\theta}(\ell,\ell^{\prime})$ is the covariance function for the NNGP (see Datta et al., 2016a, for a closed form expression). For example, with $r=n$ and ${\cal R}$ the set of observed locations, one can build a scalable Bayesian hierarchical model exactly as with a usual spatial process, but assigning an NNGP to the spatial random effects. Here is a simple NNGP-based spatial model with a first stage exponential family model:

[TABLE]

where $P_{\tau}$ is an exponential family distribution with link function $g(\cdot)$ . Posterior sampling from (34) is customarily performed using Gibbs sampling with Metropolis steps. Computational benefits emerge from the fact that the full conditional distribution $p(w(\ell_{i})\,|\,w_{{\cal R}},\theta,\beta,\tau)=p(w(\ell_{i})\,|\,w_{N(\ell_{i})},\theta,\beta,\tau)$ and since $w_{N(\ell_{i})}$ is an $m\times 1$ subset of $w_{{\cal R}}$ . Prediction at any arbitrary location $\ell\notin{\cal R}$ is performed by sampling from the posterior predictive distribution. For each draw of $\{w_{{\cal R}},\beta,\theta,\tau\}$ from $p(w_{{\cal R}},\beta,\tau,\theta\,|\,y)$ , we draw a $w(\ell)$ from $N(a^{\top}(\ell)w_{N(\ell)},\delta^{2}(\ell))$ and $y(\ell)$ from $p(y(\ell)\,|\,\beta,w(\ell),\tau)$ , where $y$ is the vector of observed outcomes and $a(\ell)$ is a vector of the nonzero $a_{j}(\ell)$ ’s in (30).

Another, even simpler, example could be modeling a continuous outcome itself as an NNGP. Let the desired full GP specification be $Y(\ell)\sim GP(x^{\top}(\ell)\beta,K_{\theta}(\cdot,\cdot))$ . We derive the NNGP from this $K_{\theta}$ and obtain

[TABLE]

The above model is extremely fast. The likelihood is of the form $y\sim N(X\beta,\tilde{K}_{\theta})$ , where $\tilde{K}^{-1}_{\theta}=(I-A^{\top})D^{-1}(I-A)$ is sparse and $A$ and $D$ are obtained from (28) efficiently in parallel. The parameter space of interest is $\{\theta,\beta\}$ , which is much smaller than for (34) where the latent spatial process also was unknown. While (35) does not separate the residuals into a spatial process and a measurement error process, one can still include measurement error variance, or the nugget, in (35). Here, one would absorb the nugget into $\theta$ . For example, we could write the likelihood in (1) as $N(y\,|\,X\beta,K_{\theta})$ , where $K_{\theta}=\sigma^{2}R_{\phi}+\tau^{2}I_{n}$ , $R_{\phi}$ is a spatial correlation matrix and $\theta=\{\sigma^{2},\phi,\tau^{2}\}$ . These will also feature in the derived NNGP covariance matrix $\tilde{K}_{\theta}$ . We can predict the outcome at an arbitrary point $\ell$ by sampling from the posterior predictive distribution as follows: for each draw of $\{\beta,\theta\}$ from $p(\beta,\theta\,|\,y)$ , we draw a $y(\ell)$ from $N(y(\ell)\,|\,x^{\top}(\ell)\beta,\delta^{2}(\ell))$ . Note, however, that there is no latent smooth process $w(\ell)$ in (35) and inference on the latent spatial process is precluded.

Likelihood computations in NNGP models usually involve $O(nm^{3})$ flops. One does not need to store $n\times n$ matrices, only $m\times m$ matrices which leads to storage $\sim nm^{2}$ . Substantial computational savings accrue because $m$ is usually very small. Datta et al. (2016a) demonstrate that fitting NNGP models to the simulated data in Figure 2 with number of neighbors as less as $m=10$ produce posterior estimates of the spatial surface indistinguishable from Figures 2(a) and 2(b). In fact, simulation experiments in Datta et al. (2016a) and Datta et al. (2016b) also affirm that $m$ can usually be taken to be very small compared to $r$ ; there seems to be no inferential advantage to taking $m$ to exceed 15, even for datasets with over $10^{5}$ spatial locations. For example, Figure 5 shows the $95\%$ posterior credible intervals for a series of 10 simulation experiments where the true effective range was fixed at values from 0.1 to 1.0 in increments of 0.1. Each dataset comprised $2500$ points. Even with $m=10$ neighbors, the credible intervals for the effective spatial range from the NNGP model were very consistent with those from the full GP model. Datta et al. (2016a) present simulations using the Matérn and other covariance functions revealing very similar behavior.

Another important point to note is that $\tilde{K}_{\theta}$ is not invariant to the order in which we define $H(\ell_{1})\subseteq H(\ell_{2})\subseteq\cdots\subseteq H(\ell_{r})$ (i.e., the topological order). Vecchia (1988) and Stein et al. (2004) both assert that the approximation in (29) is not sensitive to this ordering. This is corroborated by simulation experiments by Datta et al. (2016a), but a recent manuscript by Guinness (2016) has indicated sensitivity to the ordering in terms of model deviance. We conducted some preliminary investigations to investigate the effect of the topological order. In one simple experiment we generated data from the “true” model in (1) for $6400$ spatial locations arranged over an $80\times 80$ grid. The parameter $\beta$ in (1) was set to [math], the covariance function was specified as $K_{\theta}(\ell_{i},\ell_{j})=\sigma^{2}\exp(-\phi\|\ell_{i}-\ell_{j}\|)$ , and $\epsilon(\ell_{i})\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\tau^{2})$ with the true values of $\sigma^{2}$ , $\phi$ and $\tau^{2}$ given in the second column of Table 2. Four different NNGP models corresponding to (35) with $\tilde{K}_{\theta}$ derived from $K_{\theta}=\sigma^{2}R_{\phi}+\tau^{2}I$ and $R_{\phi}$ having elements $\exp(-\phi\|\ell_{i}-\ell_{j}\|)$ , were fitted to the simulated data. Each of these models were constructed with $m=10$ nearest neighbors, but with different ordering of the points $\ell=(x,y)$ . These were performed according to the sum of the coordinates $x+y$ , a maximum-minimum distance (MMD) proposed by Guinness (2016), the $x$ coordinate, and the $y$ coordinate. Table 2 presents a comparison of these NNGP models. Irrespective of the ordering of the points, the inference with respect to parameter estimates and predictive performance is extremely robust and effectively indistinguishable from each other. However, the posterior mean of the Kullback-Leibler divergence of these models from the true generating model revealed that the metric proposed by Guinness (2016) is indeed less than the other three. Further explorations are currently being conducted to see how this behavior changes for more complex nonstationary models and in more general settings.

4 Discussion and future directions

The article has attempted to provide some insight into constructing highly scalable Bayesian hierarchical models for very large spatiotemporal datasets using low-rank and sparsity-inducing processes. Such models are increasingly being employed to answer complex scientific questions and analyze massive spatiotemporal datasets in the natural and environmental sciences. Any standard Bayesian estimation algorithm, such as Markov chain and Hamiltonian Monte Carlo (see, e.g., Robert and Casella, 2004; Brooks et al., 2011; Gelman et al., 2013; Neal, 2011; Hoffman and Gelman, 2014), Integrated Nested Laplace Approximations (Rue et al., 2009), and Variational Bayes (see, e.g., Bishop, 2006) can be used for fitting these models. The models ensure that the algorithmic complexity has $\sim n$ floating point operations (flops), where $n$ the number of spatial locations (per iteration). Storage requirements are also linear in $n$ . Methods such as the multiresolution predictive process (Katzfuss, 2017) and the NNGP (Datta et al., 2016a) can scale up to datasets in the order of $\sim 10^{6}$ spatial and/or temporal points without sacrificing richness in the model.

While the NNGP certainly seem to have an edge in scalability over the more conventional low-rank or fixed rank models, it is premature to say whether its inferential performance will always excel over low rank of fixed rank models. For example, analyzing complex nonstationary random fields may pose challenges regarding construction of neighbor sets as simple distance-based definition of neighbors may prove to be inadequate. Multiresolution basis functions may be more adept at capturing nonstationary, but may struggle with massive datasets. Dynamic neighbor selection for nonstationary fields, where neighbors will be chosen based upon the covariance kernel itself, analogous to Datta et al. (2016b) for space-time covariance functions, may be an option worth exploring. Multiresolution NNGPs, where the residual from the NNGP approximation is modeled hierarchically (analogous to Katzfuss, 2017, for the predictive process) may also be promising in terms of full Bayesian inference at massive scales.

There remain other challenges in high-dimensional geostatistics. Here, we have considered geostatistical settings where we have very large numbers of locations and/or time-points, but restricted our discussion to univariate outcomes. In practice, we often observe a $q\times 1$ variate response $y(\ell)$ along with a set of explanatory variables $X(\ell)$ and $q\times 1$ variate GP, $w(\ell)$ , is used to capture the spatial patterns beyond the observed covariates. We seek to capture associations among the variables as well as the strength of spatiotemporal association for each outcome. One specific geostatistical problem in ecology that currently lacks a satisfying solution is a joint species distribution model, where we seek to model a large collection of species (say, order $10^{3}$ ) over a large collection of spatial sites (again, say, order $10^{3}$ ).

The linear model of coregionalization (LMC) proposed by Matheron (1982) is among the most general models for multivariate spatial data analysis. Here, the spatial behavior of the outcomes is assumed to arise from a linear combination of the independent latent processes operating at different spatial scales (Chilés and Delfiner, 1999). The idea resembles latent factor analysis (FA) models for multivariate data analysis (e.g., Anderson, 2003) except that in the LMC the number of latent processes is usually taken to be the same as the number of outcomes. Then, an $q\times q$ covariance matrix has to be estimated for each spatial scale (see, e.g., Lark and Papritz, 2003; Castrignanó et al., 2005; Zhang, 2007), where $q$ is the number of outcomes. When $q$ is large (e.g., $q\geq 5$ and 300 spatial locations), obtaining such estimates is expensive. Schmidt and Gelfand (2003) and Gelfand et al. (2004) associate only a $q\times q$ triangular matrix with the latent processes. However, high dimensional outcomes are still computationally prohibitive for these models.

Spatial factor models (see, e.g., Lopes and West, 2004; Lopes et al., 2008; Wang and Wall, 2003) have been used to handle high dimensional outcomes but with modest number of spatial locations. Dimension reduction is needed in two aspects: (i) the length of the vector of outcomes, and (ii) the very large number of spatial locations. Latent variable (factor) models are usually used to address the former, while low-rank spatial processes offer a rich and flexible modeling option for dealing with a large number of locations. Ren and Banerjee (2013) have exploited these two ideas to propose a class of hierarchical low-rank spatial factor models and also explored stochastic selection of the latent factors without resorting to complex computational strategies (such as reversible jump algorithms) by utilizing certain identifiability characterizations for the spatial factor model. Their model was designed to capture associations among the variables as well as the strength of spatial association for each variable. In addition, they reckoned with the common setting where not all the variables have been observed over all locations, which leads to spatial misalignment. The fully Bayesian approach effectively deals with spatial misalignment, but is likely to suffer from the limited ability of low-rank models to scale to a very large number of locations. Promising ideas include using the multiresolution predictive process or the NNGP as a prior on the spatial factors.

Computational developments with regard to Markov chain Monte Carlo (MCMC) algorithms (see, e.g., Robert and Casella, 2004) have contributed enormously to the dissemination of Bayesian hierarchical models in a wide array of disciplines. Spatial modeling is no exception. However, the challenges for automated implementation of geostatistical model fitting and inference are substantial. First, expensive matrix computations are required that can become prohibitive with large datasets. Second, routines to fit unmarginalized models are less suited for direct updating using a Gibbs sampler and result in slower convergence of the chains. Third, investigators often encounter multivariate spatial datasets with several spatially dependent outcomes, whose analysis requires multivariate spatial models that involve demanding matrix computations. These issues have, however, started to wane with the delivery of relatively simpler software packages in the R statistical computing environment via the Comprehensive R Archive Network (CRAN) (http://cran.r-project.org). Several packages that automate Bayesian methods for point-referenced data and diagnose convergence of MCMC algorithms are easily available from CRAN. Packages that fit Bayesian models include geoR, geoRglm, spTimer, spBayes, spate, and ramps.

In terms of the hierarchical geostatistical models presented in this article, spBayes offers users a suite of Bayesian hierarchical models for Gaussian and non-Gaussian univariate and multivariate spatial data as well as dynamic Bayesian spatio-temporal models. It focuses upon performance issues for full Bayesian inference, sampler convergence rate and efficiency using a collapsed Gibbs sampler, decreasing sampler run-time by avoiding expensive matrix computations, and increased scalability to large datasets by implementing predictive process models. Beyond these general computational improvements for existing models, it analyzes data indexed both in space and time using a class of dynamic spatiotemporal models, and their predictive process counterparts, for settings where space is viewed as continuous and time is taken as discrete. Finally, we have modeling environments such as Nimble (de Valpine et al., 2017) that gives users enormous flexibility to choose algorithms for fitting their models, and Stan (Carpenter et al., 2017) that estimates Bayesian hierarchical models using Hamiltonian dynamics. The NNGP and the predictive process can be also coded in Nimble and Stan fairly easily.

Acknowledgments

The author wishes to thank Professor Bruno Sansó and two anonymous reviewers for very constructive and insightful feedback. In addition, the author also wishes to thank Dr. Abhirup Datta, Dr. Andrew O. Finley and Ms. Lu Zhang for useful discussions. The work of the author was supported in part by NSF DMS-1513654 and NSF IIS-1562303.

Bibliography92

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anderson (2003) Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis . New York, NY: Wiley, third edition.
2Banerjee et al. (2014) Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014). Hierarchical Modeling and Analysis for Spatial Data . Boca Raton, FL: Chapman & Hall/CRC, second edition.
3Banerjee et al. (2010) Banerjee, S., Finley, A. O., Waldmann, P., and Ericcson, T. (2010). “Hierarchical Spatial Process Models for Multiple Traits in Large Genetic Trials.” Journal of the American Statistical Association , 105: 506–521.
4Banerjee et al. (2008) Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). “Gaussian Predictive Process Models for Large Spatial Datasets.” Journal of the Royal Statistical Society, Series B , 70: 825–848.
5Barry and Ver Hoef (1996) Barry, R. and Ver Hoef, J. (1996). “Blackbox kriging: Spatial prediction without specifying variogram models.” Journal of Agricultural, Biological and Environmental Statistics , 1: 297–322.
6Bevilacqua and Gaetan (2014) Bevilacqua, M. and Gaetan, C. (2014). “Comparing Composite Likelihood Methods Based on Pairs for Spatial Gaussian Random Fields.” Statistics and Computing , 1–16.
7Bishop (2006) Bishop, C. (2006). Pattern Recognition and Machine Learning . New York, NY: Springer-Verlag.
8Brooks et al. (2011) Brooks, S., Gelman, A., Jones, G. L., and Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo . Boca Raton, FL: CRC Press.