
TL;DR
This paper reviews two scalable Bayesian spatiotemporal modeling approaches—low-rank processes and Nearest-Neighbor Gaussian Processes—that enable efficient analysis of large geostatistical datasets by reducing computational complexity.
Contribution
It introduces and compares two novel methods for constructing scalable Bayesian spatiotemporal priors suitable for large datasets, addressing computational challenges in hierarchical models.
Findings
Both methods achieve linear computational complexity in the number of locations.
The approaches facilitate full Bayesian inference for large-scale spatiotemporal data.
Comparison provides insights into their methodological differences and applications.
Abstract
With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatiotemporal process models have become widely deployed statistical tools for researchers to better understand the complex nature of spatial and temporal variability. However, fitting hierarchical spatiotemporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. This article offers a focused review of two methods for constructing well-defined highly scalable spatiotemporal stochastic processes. Both these processes can be used as "priors" for…
| RMSPE | ||||
|---|---|---|---|---|
| True | 1 | 1 | 1 | |
| PP | 1.37 (0.29,2.61) | 1.37 (0.65,2.37) | 1.18 (1.07,1.23) | 1.21 |
| MPP | 1.36 (0.51,2.39) | 1.04 (0.52,1.92) | 0.94 (0.68.1,14) | 1.20 |
| PP | 1.36 (0.52,2.32) | 1.39 (0.76,2.44) | 1.09 (0.96, 1.24) | 1.17 |
| MPP | 1.33 (0.50,2.24) | 1.14 (0.64,1.78) | 0.93 (0.76,1.22) | 1.17 |
| PP | 1.31 (0.23, 2.55) | 1.12 (0.85,1.58) | 0.99 (0.85,1.16) | 1.17 |
| MPP | 1.31 (0.23,2.63) | 1.04 (0.76,1.49) | 0.98 (0.87,1.21) | 1.17 |
| NNGP from different topological orders | |||||
|---|---|---|---|---|---|
| True | Sorted coord(x+y) | MMD | Sorted x | Sorted y | |
| 1 | 0.79 (0.69, 1.04 ) | 0.80 (0.69, 1.02) | 0.80 (0.70, 1.05) | 0.83 (0.69, 1.08) | |
| 0.45 | 0.45 (0.44, 0.46) | 0.45 (0.44, 0.47) | 0.45 (0.44, 0.46 ) | 0.45 (0.44, 0.47) | |
| 5 | 8.11 (4.42, 11.10) | 7.63 (4.58, 10.97) | 8.01 (4.26, 11.18) | 7.12 (4.06, 11.03) | |
| KL-D | – | 24.04022 | 13.88847 | 22.30667 | 21.59174 |
| RMSPE | – | 0.5278996 | 0.5278198 | 0.527912 | 0.527807 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
High-dimensional Bayesian Geostatistics
Sudipto Banerjeelabel=e1][email protected] [ UCLA Department of Biostatistics
650 Charles E. Young Drive South
Los Angeles, CA 90095-1772.
Abstract
With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatiotemporal process models have become widely deployed statistical tools for researchers to better understand the complex nature of spatial and temporal variability. However, fitting hierarchical spatiotemporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. This article offers a focused review of two methods for constructing well-defined highly scalable spatiotemporal stochastic processes. Both these processes can be used as “priors” for spatiotemporal random fields. The first approach constructs a low-rank process operating on a lower-dimensional subspace. The second approach constructs a Nearest-Neighbor Gaussian Process (NNGP) that ensures sparse precision matrices for its finite realizations. Both processes can be exploited as a scalable prior embedded within a rich hierarchical modeling framework to deliver full Bayesian inference. These approaches can be described as model-based solutions for big spatiotemporal datasets. The models ensure that the algorithmic complexity has floating point operations (flops), where the number of spatial locations (per iteration). We compare these methods and provide some insight into their methodological underpinnings.
Bayesian statistics,
Gaussian process,
Low rank Gaussian process,
Nearest Neighbor Gaussian process (NNGP),
Predictive process,
Sparse Gaussian process,
Spatiotemporal statistics,
keywords:
\startlocaldefs\endlocaldefs
1 Introduction
The increased availability of inexpensive, high speed computing has enabled the collection of massive amounts of spatial and spatiotemporal datasets across many fields. This has resulted in widespread deployment of sophisticated Geographic Information Systems (GIS) and related software, and the ability to investigate challenging inferential questions related to geographically-referenced data. See, for example, the books by Cressie (1993), Stein (1999), Moller and Waagepetersen (2003), Schabenberger and Gotway (2004), Gelfand et al. (2010), Cressie and Wikle (2011) and Banerjee et al. (2014) for a variety of statistical methods and applications.
This article will focus only on point-referenced data, which refers to data referenced by points with coordinates (latitude-longitude, Easting-Northing etc.). Modeling typically proceeds from a spatial or spatiotemporal process that introduces dependence among any finite collection of random variables from an underlying random field. For our purposes, we will consider the stochastic process as an uncountable set of random variables, say , over a domain of interest , which is endowed with a probability law specifying the joint distribution for any finite sample from that set. For example, in spatial modeling is often assumed to be a subset of points in the Euclidean space (usually or ) or, perhaps, a set of geographic coordinates over a sphere or ellipsoid. In spatiotemporal settings , where is the spatial region, is the time domain and is a space-time coordinate with spatial location and time point (see, e.g., Gneiting and Guttorp, 2010, for details).
Such processes are specified with a covariance function that gives the covariance between and for any two points and in . For any finite collection in , let be the realizations of the process over . Also, for two finite sets and containing and points in , respectively, we define the matrix , where the covariances are evaluated using . When or contains a single point, is a row or column vector, respectively. A valid spatiotemporal covariance function ensures that is positive definite for any finite set . In geostatistics, we usually deal with a fixed set of points and, if the context is clear, we write simply as . A popular specification assumes is a zero-centered Gaussian process written as , which implies that the vector is distributed as , where is the covariance matrix with -th element . Various characterizations and classes of valid spatial (and spatiotemporal) covariance functions can be found in Gneiting and Guttorp (2010), Cressie (1993), Stein (1999), Gelfand et al. (2010), Cressie and Wikle (2011) and Banerjee et al. (2014) and numerous references therein. The more common assumptions are of stationarity and isotropy. The former assumes that depends upon the coordinates only through their separation vector, while isotropy goes a step further and assumes the covariance is a function of the distance between them.
Spatial and spatiotemporal processes are conveniently embedded within Bayesian hierarchical models. The most common geostatistical setting assumes a response or dependent variable observed at a generic point along with a () vector of spatially referenced predictors . Model-based geostatistical data analysis customarily envisions a spatial regression model,
[TABLE]
where is the vector of slopes, and the residual from the regression is the sum of a spatial or spatiotemporal process, capturing spatial and/or temporal association, and an independent process, modeling measurement error or fine scale variation attributed to disturbances at distances smaller than the minimum observed separations in space and time. A Bayesian spatial model can now be constructed from (1) as
[TABLE]
where is the vector of observed outcomes, is the matrix of regressors with -th row and the noise covariance matrix represents measurement error or micro-scale variation and depends upon a set of variance parameters . A common specification is , where is called the “nugget.” The hierarchy is completed by assigning prior distributions to , and .
Bayesian inference can proceed by sampling from the joint posterior density in (2) using, for example, Markov chain Monte Carlo (MCMC) methods (see, e.g., Robert and Casella, 2004). A major computational bottleneck emerges from the size of in computing (2). Since is unknown, each iteration of the model fitting algorithm will involve decomposing or factorizing , which typically requires floating point operations (flops). Memory requirements are of the order . These become prohibitive for large values of when has no exploitable structure. Evidently, multivariate process settings, where is a vector of outcomes, exacerbate the computational burden by a factor of . For Gaussian likelihoods, one can integrate out the random effects from (2). This reduces the parameter space to , but one still needs to work with , which is again . These settings are referred to as “big-n” or “high-dimensional” problems in geostatistics and are widely encountered in environmental sciences today.
As modern data technologies are acquiring and exploiting massive amounts of spatiotemporal data, modeling and inference for large spatiotemporal datasets are receiving increased attention. In fact, it is impossible to provide a comprehensive review of all existing methods for geostatistical models for massive spatial data sets; Sun et al. (2011) offers an excellent review for a number of methods for high-dimensional geostatistics. The ideas at the core of fitting models for large spatial and spatiotemporal data concern effectively solving positive definite linear systems such as , where is a covariance matrix. Thus one can use probability models to build computationally efficient covariance matrices. One approach is to approximate or model with a covariance structure that can significantly reduce the computational burden. An alternative is to model itself with an exploitable structure so that the solution is available without computing the inverse. For full Bayesian inference, one also needs to ensure that the determinant of is available easily.
We remark that when inferring about stochastic processes, it is also possible to work in the spectral domain. This rich, and theoretically attractive, option has been advocated by Stein (1999) and Fuentes (2007) and completely avoids expensive matrix computations. The underlying idea is to transform to the space of frequencies, construct a periodogram (an estimate of the spectral density), and exploit the Whittle likelihood (see, e.g., Whittle, 1954; Guyon, 1995) in the spectral domain as an approximation to the data likelihood in the original domain. The Whittle likelihood requires no matrix inversion so, as a result, computation is very rapid. In principle, inversion back to the original space is straightforward. However, there are practical impediments. First, there is discretization to implement a fast Fourier transform whose performance can be tricky over large irregular domains. Predictive inference at arbitrary locations also will not be straightforward. Other issues include arbitrariness to the development of a periodogram. Empirical experience is employed to suggest how many low frequencies should be discarded. Also, there is concern regarding the performance of the Whittle likelihood as an approximation to the exact likelihood. While this approximation is reasonably well centered, it does an unsatisfactory job in the tails (thus leading to poor estimation of model variances). Lastly, modeling non-Gaussian first stages will entail unobservable random spatial effects, making the implementation impossible. In summary, use of the spectral domain with regard to handling large , while theoretically attractive, has limited applicability.
Broadly speaking, model-based approaches for large spatial datasets proceeds from either exploiting “low-rank” models or exploiting “sparsity”. The former attempts to construct Gaussian processes on a lower-dimensional subspace (see, e.g., Wikle and Cressie, 1999; Higdon, 2002a; Kammann and Wand, 2003; Quinoñero and Rasmussen, 2005; Stein, 2007; Gramacy and Lee, 2008; Stein, 2008; Cressie and Johannesson, 2008; Banerjee et al., 2008; Crainiceanu et al., 2008; Sansó et al., 2008; Finley et al., 2009a; Lemos and Sansó, 2009; Cressie et al., 2010) in spatial, spatiotemporal and more general Gaussian process regression settings. Sparse approaches include covariance tapering (see, e.g., Furrer et al., 2006; Kaufman et al., 2008; Du et al., 2009; Shaby and Ruppert, 2012) using compactly supported covariance functions. This is effective for parameter estimation and interpolation of the response (“kriging”), but it has not been fully evaluated for fully Bayesian inference on residual or latent processes. Introducing sparsity in is prevalent in approximating Gaussian process likelihoods using Markov random fields (e.g., Rue and Held, 2005), products of lower dimensional conditional distributions (Vecchia, 1988, 1992; Stein et al., 2004), or composite likelihoods (e.g., Bevilacqua and Gaetan, 2014; Eidsvik et al., 2014).
This article aims to provide a focused review of some massively scalable Bayesian hierarchical models for spatiotemporal data. The aim is not to provide a comprehensive review of all existing methods. Instead, we focus upon two fully model-based approaches that can be easily embedded within hierarchical models and deliver full Bayesian inference. These are low-rank processes and sparsity-inducing processes. Both these processes can be used as “priors” for spatiotemporal random fields. Here is a brief outline of the paper. Section 2 discusses a Bayesian hierarchical framework for low-rank models and their implementation. Section 3 discusses some recent developments in sparsity-inducing Gaussian processes, especially nearest-neighbor Gaussian processes, and their implementation. Finally, Section 4 provides a brief account of outstanding issues for future research.
2 Hierarchical low-rank models
A popular way of dealing with large spatial datasets is to devise models that bring about dimension reduction (Wikle and Cressie, 1999). A low rank or reduced rank specification is typically based upon a representation or approximation in terms of the realizations of some latent process over a smaller set of points, often referred to as knots. To be precise,
[TABLE]
where is a well-defined process and is a family of basis functions possibly depending upon some parameters . The collection of locations are the knots, and are vectors with components and , respectively. For any collection of points, the vector is represented as , where is with -th element . Irrespective of how big is, we now have to work with the (instead of ) ’s and the matrix . Since we anticipate , the consequential dimension reduction is evident and, since we will write the model in terms of the ’s (with the ’s being deterministic from the ’s, given ), the associated matrices we work with will be . Evidently, as defined in (3) spans only an -dimensional space. When , the joint distribution of is singular. However, we do create a valid stochastic process with covariance function
[TABLE]
where is the variance-covariance matrix (also depends upon parameter ) for . From (4), we see that, even if is stationary, the induced covariance function is not. If the ’s are Gaussian, then is a Gaussian process. Every choice of basis functions yields a process and there are too many choices to enumerate here. Wikle (2010) offers an excellent overview of low rank models.
Different families of spatial models emerge from different specifications for the process and the basis functions . In fact, (3) can be used to construct classes of rich and flexible processes. Furthermore, such constructions need not be restricted to low rank models. If dimension reduction is not a concern, then full rank models can be constructed by taking basis functions in (3). A very popular specification for is a white noise process so that , whereupon (4) simplifies to . A natural choice for the basis functions is a kernel function, say , which puts more weight on near . Variants of this form have been called “moving average” models and explored by Barry and Ver Hoef (1996), while the term “kernel convolution” has been used in a series of papers by Higdon and collaborators (Higdon, 1998; Higdon et al., 1999; Higdon, 2002b) to not only achieve dimension reduction, but also model nonstationary and multivariate spatial processes. The kernel (which induces a parametric covariance function) can depend upon parameters and might even be spatially varying (Higdon, 2002b; Paciorek and Schervish, 2006). Sansó et al. (2008) use discrete kernel convolutions of independent processes to construct two different class of computationally efficient spatiotemporal processes.
Some choices of basis functions can be more computationally efficient than others depending upon the specific application. For example, Cressie and Johannesson (2008) (also see Shi and Cressie (2007)) discuss “Fixed Rank Kriging” (FRK) by constructing using very flexible families of non-stationary covariance functions to carry out high-dimensional kriging, Cressie et al. (2010) extend FRK to spatiotemporal settings calling the procedure “Fixed Rank Filtering” (FRF), Katzfuss and Cressie (2012) provide efficient constructions for for massive spatiotemporal datasets, and Katzfuss (2013) uses spatial basis functions to capture medium to long range dependence and tapers the residual to capture fine scale dependence. Multiresolution basis functions (see, e.g., Nychka et al., 2002, 2015) have been shown to be effective in building computationally efficient nonstationary models. These papers amply demonstrate the versatility of low-rank approaches using different basis functions.
A different approach is to specify the as a spatial process model having a selected covariance function. This process is called the parent process and one can derive a low-rank process from the parent process. For example, one could use the Karhunen-Loeve (infinite) basis expansion for a Gaussian process (see, e.g., Rasmussen and Williams, 2005; Banerjee et al., 2014) and truncate it to a finite number of terms to obtain a low-rank process. Another example is to project the realizations of the parent process onto a lower-dimensional subspace, which yields the predictive process and its variants; see Section 2.2 for details.
The idea underlying low-rank dimension reduction is not dissimilar to Bayesian linear regression. For example, consider a simplified version of the hierarchical model in (2), where and the process parameters are fixed. A low rank version of (2) is obtained by replacing with , so the joint distribution is
[TABLE]
where is , is , and are positive definite matrices of sizes and , respectively, and is . The low rank specification is accommodated using and the prior on , while (usually diagonal) has the residual variance components. By computing the marginal covariance matrix in two ways (Lindley and Smith, 1972), one arrives at the well-known Sherman-Woodbury-Morrison formula
[TABLE]
The above formula reveals dimension reduction in terms of the marginal covariance matrix for . If is easily invertible (e.g., diagonal), then the inverse of an covariance matrix of the form can be computed efficiently using the right-hand-side which only involves inverses of matrices and . A companion formula for (6) is that for the determinant,
[TABLE]
which shows that the determinant of the matrix can be computed as a product of the determinants of two matrices and that of .
In practical Bayesian computations, however, it is less efficient to directly use the formulas in (6) and (7). Since both the inverse and the determinant are needed, it is more useful to compute the Cholesky decomposition of the covariance matrix. In fact, one can avoid (6) completely and resort to a common trick in hierarchical models (see, e.g., Gelman et al., 2013) and smoothed ANOVA (Hodges, 2013) that expresses (5) as the linear model
[TABLE]
and are matrix square roots of of and , respectively. For example, in practice is diagonal so is simply the square root of the diagonal elements of , while is the triangular (upper or lower) Cholesky factor of the matrix . The marginal density of after integrating out now corresponds to the linear model , where is the ordinary least-square estimate of . Such computations are easily conducted in statistical programming environments such as R by applying the chol function to obtain the Cholesky factor , a backsolve function to efficiently obtain in constructing (10), and an lm function to compute the least squares estimate of using the QR decomposition of the design matrix . We discuss implementation of low rank hierarchical models in a more general contexts in Section 2.3.
2.1 Biases in low-rank models
Irrespective of the precise specifications, low-rank models tend to underestimate uncertainty (since they are driven by a finite number of random variables), hence, overestimate the residual variance (i.e., the nugget). Put differently, this arises from systemic over-smoothing or model under-specification by the low-rank model when compared to the parent model. For example, if , where is the parent process and is a low-rank approximation, then ignoring the residual can result in loss of uncertainty and oversmoothing. In settings where the spatial signal is weak compared to the noise, such biases will be less pronounced. Also, it is conceivable that in certain specific case studies proper choices of basis functions (e.g., multiresolution basis functions) will be able to capture much of the spatial behavior and the effect of the bias will be mitigated. However, in general it will be preferable to develop models that will be able to compensate for the overestimation of the nugget.
This phenomenon, in fact, is not dissimilar to what is seen in linear regression models and is especially transparent from writing the parent likelihood and low-rank likelihood as mixed linear models. To elucidate, suppose, without much loss of generality, that is a set with points of which the first act as the knots. Let us write the Gaussian likelihood with the parent process as , where is the lower-triangular Cholesky factor of ( depends on , but we suppress this here) and is now an vector such that . Writing , where has columns, suppose we derive a low-rank model by truncating to only the first basis functions. The corresponding likelihood is , where is an vector whose components are independently and identically distributed variables. Customary linear model calculations reveal that the magnitude of the residual vector from the parent model is given by , while that from the low-rank model is given by , where denotes the orthogonal projector matrix onto the column space of any matrix . Using the fact that , which is a standard result in linear model theory, we find the excess residual variability in the low-rank likelihood is summarized by which can be substantial when is much smaller than .
In practical data analysis, the above phenomenon is usually manifested by an overestimation of the nugget variance as it absorbs the residual variation from the low-rank approximation. Consider the following simple experiment. We simulated a spatial dataset using the spatial regression model in (1) with fixed spatial locations, say , within the unit square, and setting , , , where with and . We then fit the low rank model (5) with , , and as the matrix with -th row , where is a set of knots, is the vector with -th element and is the inverse of the lower-triangular Cholesky factor of the matrix with elements . This emerges from using low-rank radial basis functions in (3); (see, e.g., Ruppert and Carroll, 2003). We fit such models increasing from to in steps of . Figure 1 presents the 95% posterior credible intervals for . Even with knots for a dataset with just spatial locations, the estimate of the nugget was significantly different from the true value of the parameter. This indicates that low rank processes may be unable to accurately estimate the nugget from the true process. Also, they will likely produce oversmoothed interpolated maps of the underlying spatial process and impair predictive performance. As one specific example, Table 4 in Banerjee et al. (2008) report less than optimal posterior predictive coverage from a predictive process model (see Section 2.2) with over 500 knots for a dataset comprising 15,000 locations.
Although this excess residual variability can be quantified as above (for any given value of the covariance parameters ), it is less clear how the low-rank likelihood could be modified to compensate for this oversmoothing without adding significantly to the computational burden. Matters are complicated by the fact that expressions for the excess variability will involve the unknown process parameters , which must be estimated. In fact, not all low-rank models deliver a straightforward quantification for this bias. For instance, low-rank models based upon kernel convolutions approximate with , where is some kernel function and , assumed to arise from a Brownian motion on . The difference does not, in general, render a closed form and may be difficult to approximate efficiently.
2.2 Predictive process models and variants
One particular class of low-rank processes have been especially useful in providing easy tractability to the residual process. Let and let be the vector of ’s over a set of knots. The usual spatial interpolant (that leads to “kriging”) at an arbitrary site is
[TABLE]
This single site interpolator, in fact, is a well-defined process with covariance function, . We refer to as the predictive process derived from the parent process . The realizations of are precisely the kriged predictions conditional upon a realization of over . The process is completely specified given the covariance function of the parent process and the set of knots, . The corresponding basis functions in (3) are given by . These methods have are referred to as subset of regressors in Gaussian process regressions for large data sets in machine learning (Quinoñero and Rasmussen, 2005; Rasmussen and Williams, 2005). Banerjee et al. (2008) coined the term predictive process (as the process could be derived from kriging equations) and developed classes of scalable Bayesian hierarchical spatial process models by replacing the parent process with its predictive process counterpart. An alternate derivation is available by truncating the Karhunen-Loeve (infinite) basis expansion for a Gaussian process to a finite number of terms and solving (approximately) the integral eigen-system equation for by an approximate linear system over the set of knots (see, e.g., Rasmussen and Williams, 2005; Sang and Huang, 2012; Banerjee et al., 2014).
Exploiting elementary properties of conditional expectations, we obtain
[TABLE]
which implies that and the variance of is simply the difference of the variances. For Gaussian processes, we get the following closed form for ,
[TABLE]
Therefore, , which we denote as .
Perhaps the simplest way to remedy the bias in the predictive process is to approximate the residual process with a heteroskedastic process . We construct a modified or bias-adjusted predictive process as
[TABLE]
where is independent of . It is easy to see that , so the variance of the two processes are the same. Also, the remedy is computationally efficient – adding an independent space-varying nugget does not incur substantial computational expense. Finley et al. (2009b) offer computational details for the modified predictive process, while Banerjee et al. (2010) show the effectiveness of the bias adjustment in mitigating the effect exhibited in Figure 1 and in estimating multiple variance components in the presence of different structured random effects.
We present a brief simulation example revealing the benefits of the modified predictive process. We generate 2000 locations within a square and then generate the outcomes from (1) using only an intercept as the regressor, an exponential covariance function with range parameter (i.e., such that the spatial correlation is at 50 distance units), scale for the spatial process, and with nugget variance . We then fit the predictive process and modified predictive process models derived from (1) using a hold out set of randomly selected sites, along with a separate set of regular lattices for the knots (, and ). Table 1 shows the posterior estimates and the square roots of MSPE based on the predictions for the hold-out data. The overestimation of by the predictive process is apparent and we also see how the modified predictive process is able to adjust for the . Not surprisingly, the RMSPE is essentially the same under either process model.
Further enhancements to the modified predictive process are possible. Since the modified predictive process adjusts only the variance, information in the covariance induced by the residual process is lost. One alternative is to use the so called “full scale approximation” proposed by Sang et al. (2011) and Sang and Huang (2012), where is approximated by a tapered process, say . The covariance function for is of the form , where is as in (13) and is a compactly supported covariance function that equals [math] beyond a distance (see, e.g., Furrer et al., 2006, for some practical choices.). This full scale approximation is also able to more effectively capture small scale dependence. Katzfuss (2013) extended some of these ideas by modeling the spatial error as a combination of a low-rank component designed to capture medium to long-range dependence and a tapered component to capture local dependence.
Perhaps the most promising use of the predictive process, at least in terms of scalability to massive spatial datasets, is the recent multiresolution approximation proposed by Katzfuss (2017). Instead of approximating the residual process in one step, the idea here is to partition the spatial domain recursively and construct a sequence of approximations. We start by partitioning the domain of interest into non-intersecting subregions, say , such that . We call the ’s level-1 subregions. We fix a set of knots in and write the parent process as , where is the predictive process as in (11) and is the residual Gaussian process with covariance function given by (13). At resolution 1, we replace with a block-independent process such that if and are not in the same subregion and is equal to (13) if and are in the same subregion.
At the second resolution, each is partitioned into a set of disjoint subregions . We call these the level-2 subregions and choose a set of knots within each. We approximate , where is the predictive process derived from using the knots in if and is the analogous block-independent approximation across the subregions within each . Thus, if and are not in the same level-2 subregion and will equal when and are in the same level-2 subregion. At resolution 3 we partition each of the level-2 subregions into level-3 subregions and continue the approximation of the residual process from the predictive process. At the end of resolutions, we arrive at the mult-resolution predictive process , which, by construction, is a valid Gaussian process. The computational complexity with the multi-resolution predictive process is , where is the number of resolutions and is the number of knots chosen within each subregion.
To summarize, we do not recommend the use of just a reduced/low rank model. To improve performance, it is necessary to approximate the residual process and, in this regard, the predictive process is especially attractive since the residual process is available explicitly.
2.3 Bayesian implementation for low-rank models
A very rich and flexible class of spatial and spatiotemporal models emerge from the hierarchical linear mixed model
[TABLE]
where is an vector of possibly irregularly located observations, is a known matrix of regressors (), and are families of and covariance matrices depending on unknown process parameters and , respectively, and is with . The low-rank models in (3) emerge when and is the matrix obtained by evaluating the basis functions. Proper prior distributions and for and , respectively, complete the hierarchical specification.
Bayesian inference proceeds, customarily, by sampling from (15) using Markov chain Monte Carlo (MCMC) methods. For faster convergence, we integrate out from the model and first sample from , where . Working directly with will be expensive. Usually is diagonal or sparse, so the expense is incurred from the matrix . Assuming that and are computationally inexpensive to construct for each and , requires flops. Using the Sherman-Woodbury-Morrison formula in (6) will avoid constructing or inverting any matrix. However, in practice it is better to not directly compute the right hand side of (6) as it involves some redundant matrix multiplications. Furthermore, we wish to obtain the determinant of cheaply. These are efficiently accomplished as outlined below.
The primary computational bottleneck lies in evaluating the multivariate Gaussian likelihood which is required for updating the parameters (e.g., using random-walk Metropolis or Hamiltonian Monte Carlo steps). We can accomplish this effectively using two functions: which computes the Cholesky factorization for any positive definite matrix , where is lower-triangular, and which solves the triangular system for a triangular (lower or upper) matrix . We first compute
[TABLE]
where is obtained by first computing , then the Cholesky factorization , and finally solve the triangular system . Having obtained , we compute , , , and obtain . The log-target density for is then computed as
[TABLE]
where ’s and ’s are the diagonal elements of and , respectively. The total number of flops required for evaluating the target is (since ) which is considerably cheaper than the flops that would have been required for the analogous computations in a full Gaussian process model. In practice, Gaussian proposal distributions are employed for the Metropolis algorithm and all parameters with positive support are transformed to their logarithmic scale. Therefore, the necessary Jacobian adjustments are made to (17) by adding some scalar quantities with negligible computational costs.
Starting with initial values for all parameters, each iteration of the MCMC executes the above calculations to provide a sample for . The regression parameter is then sampled from its full conditional distribution. Writing as in (16), the full conditional distribution for is , where and . These are efficiently computed as , and setting and . We then compute , where is a conformable vector of independent variables.
We repeat the above computations for each iteration of the MCMC algorithm using the current values of the process parameters in . The algorithm described above will produce, after convergence, posterior samples for . We then sample from the posterior distribution , where with and . For each drawn from we will need to draw a corresponding from . This will involve . Since the number of knots is usually fixed at a value much smaller than , obtaining is and not as expensive. However, it will involve the inverse of , which is computed using and can be numerically unstable for certain smoother covariance functions such as the Gaussian or the Matérn with large . A numerically more stable algorithm exploits the relation , where . For each sampled from , we compute , and . We generate an vector and set . Repeating this for each drawn from produces a sample of ’s from .
Finally, we seek predictive inference for at any arbitrary space-time coordinate . Given , we draw for every posterior sample of and . This yields the corresponding posterior predictive samples for and . Posterior predictive samples of the latent processes can also be easily computed as for each posterior sample of the and . Posterior predictive distributions at any of the observed ’s yield replicated data (see, e.g., Gelman et al., 2013) that can be used for model assessment and comparisons. Finley et al. (2015) provide more extensive implementation details for models such as (15) in the context of the spBayes package in R.
3 Sparsity-inducing nearest-neighbor Gaussian processes
Low-rank models have been, and continue to be, widely employed for analyzing spatial and spatiotemporal data. The algorithmic cost for fitting low-rank models typically decrease from to flops since . However, when is large, empirical investigations suggest that must be fairly large to adequately approximate the parent process and the flops become exorbitant. Furthermore, low-rank models can perform poorly depending upon the smoothness of the underlying process or when neighboring observations are strongly correlated and the spatial signal dominates the noise (Stein, 2014).
As an example, consider part of the simulation experiment presented in Datta et al. (2016a), where a spatial random field was generated over a unit square using a Gaussian process with fixed spatial process parameters over a set of locations. We then fit a full Gaussian process model and a predictive process model with knots. Figure 2 presents the results (see, e.g., Datta et al., 2016a, for details.) While the estimated random field from the full Gaussian process is almost indistinguishable from the true random field, the surface obtained from the predictive process with locations substantially oversmooths. This oversmoothing can be ameliorated by using a larger number of knots, but that adds to the computational burden.
Figure 2 serves to reinforce findings that low-rank models may be limited in their ability to produce accurate representation of the underlying process at massive scales. They will need a considerably larger number of basis functions to capture the features of the process and will require substantial computational resources for emulating results from a full GP. As the demands for analyzing large spatial datasets increase from the order of to locations, low-rank models may struggle to deliver acceptable inference. In this regard, enhancements such as the multi-resolution predictive process approximations referred to in Section 2.2 are highly promising.
An alternative is to develop full rank models that can exploit sparsity. Instead of deriving basis approximations for , one could achieve computational gains by modeling either its covariance function or its inverse as sparse. Covariance tapering does the former by modeling , where is a sparse covariance matrix formed from a compactly supported, or tapered, covariance function with tapering parameter and denotes the element wise (or Hadamard) product of two matrices. The Hadamard product of two positive definite matrices is again a positive definite matrix, so is positive definite. Furthermore, is sparse because a tapered covariance function is equal to [math] for all pairs of locations separated by a distance beyond a threshold . We refer the reader to Furrer et al. (2006), Kaufman et al. (2008) and Du et al. (2009) for further computational and theoretical details on covariance tapering. Covariance tapering is undoubtedly an attractive approach for constructing sparse covariance matrices, but its practical implementation for full Bayesian inference will generally require efficient sparse Cholesky decompositions, numerically stable determinant computations and, perhaps most importantly, effective memory management. These issues are yet to be tested for truly massive spatiotemporal datasets with or more.
Another way to exploit sparsity is to model the inverse of as a sparse matrix. For finite-dimensional distributions conditional and simultaneous autoregressive (CAR and SAR) models (see, e.g., Cressie, 1993; Banerjee et al., 2014, and references therein) adopt this approach for areally referenced datasets. More generally, Gaussian Markov random fields or GMRFs (see, e.g., Rue and Held, 2005) are widely used tools for constructing sparse precision matrices and have led to computational algorithms such as the Integrated Nested Laplace Approximation (INLA) developed by Rue et al. (2009). A subsequent article by Lindgren et al. (2011) show how Gaussian processes can be approximated by GMRFs using computationally efficient sparse representations. Thus, a Gaussian process model with a dense covariance function is approximated by a GMRF with a sparse precision matrix. The approach is very computationally efficient for certain classes of covariance functions generated by a certain class of stochastic partial differential equations (including the versatile Matérn class), but their inferential performance on unobservable spatial, spatiotemporal or multivariate Gaussian processes (perhaps specified through more general covariance or cross-covariance functions) embedded within Bayesian hierarchical models is yet to be assessed.
Rather than working with approximations to the process, one could also construct massively scalable sparsity-inducing Gaussian processes that can be conveniently embedded within Bayesian hierarchical models and deliver full Bayesian inference for random fields at arbitrary resolutions. Section 3.1 describes how sparsity is introduced in the precision matrices for graphical Gaussian models by exploiting the relationship between the Cholesky decomposition of a positive definite matrix and conditional independence. These sparse Gaussian models (i.e., normal distributions with sparse precision matrices) can be used prior models for a finite number of spatial random effects. Section 3.2 shows the construction of a process from these graphical Gaussian models. This process will be a Gaussian process whose finite-dimensional realizations will have sparse precision matrices. We call them Nearest Neighbor Gaussian Processes (NNGP). Finally, Section 3.3 outlines how the process can be embedded within hierarchical models and presents some brief simulation examples demonstrating certain aspects of inference from NNGP models.
3.1 Sparse Gaussian graphical models
Consider the hierarchical model (2) and, in particular, the expensive prior density . From the dense covariance matrix , we wish to obtain a covariance matrix such that is sparse and, importantly, its determinant is available cheaply. What would be an effective way of achieving this? One approach would be to consider modeling the Cholesky decomposition of the precision matrix so that it is sparse. For example, forcing some elements in the dense half of the triangular Cholesky factor to be zero will introduce sparsity in the precision matrix. To precisely set out which elements should be made zero in the Cholesky factor, we borrow some fundamental notions of sparsity from graphical (Gaussian) models.
The underlying idea is, in fact, ubiquitous in graphical models or Bayesian networks (see, e.g., Lauritzen, 1996; Bishop, 2006; Murphy, 2012). The joint distribution for a random vector can be looked upon as a directed acyclic graph (DAG) where each node is a random variable . We write the joint distribution as
[TABLE]
where is the empty set and for is the set of parent nodes with directed edges to . This model is specific to the ordering (sometimes called “topological ordering”) of the nodes. The DAG corresponding to this factorization is shown in Figure 3(a) for nodes. One can refer to this as the full graphical model since comprises all nodes preceding in the topological order. Shrinking from the set of all nodes preceding to a smaller subset of parent nodes yields a different, but still valid, joint distribution. In spatial settings, each of the nodes in the DAG have associated spatial coordinates. Thus, the parents for any node can be chosen to include a certain fixed number of “nearest neighbors”, say based upon their distance from node . For example, Figure 3(b) shows the DAG when some of the edges are deleted so as to retain at most nearest neighbors in the conditional probabilities. The resulting joint density is
[TABLE]
The above model posits that any node , given its parents, is conditionally independent of any other node that is neither its parent nor its child.
Applying the above notion to multivariate Gaussian densities evinces the connection between conditional independence in DAGs and sparsity. Consider an random vector distributed as . Writing as is equivalent to the following set of linear models,
[TABLE]
or, more compactly, simply , where is strictly lower-triangular with elements whenever and and is diagonal with diagonal entries and for .
From the structure of it is evident that is nonsingular and . The possibly nonzero elements of and are completely determined by the matrix . Let a[i,j], d[i,j] and K[i,j] denote the -th entries of , and , respectively. Note that d[1,1] = K[1,1] and the first row of is [math]. A pseudo-code to compute the remaining elements of and is:
[TABLE]
Here a[i+1,1:i] is the row vector comprising the possibly nonzero elements of the i+1-th row of , K[1:i,1:i] is the leading principal submatrix of , K[1:i, i] is the row vector formed by the first i elements in the i-th column of , K[i, 1:i] is the row vector formed by the first i elements in the i-th row of , solve(B,b) computes the solution for the linear system Bx = b, and dot(u,v) provides the inner product between vectors u and v. The determinant of is obtained with almost no additional cost: it is simply .
The above pseudocode provides a way to obtain the Cholesky decomposition of . If is the Cholesky decomposition, then . There is, however, no apparent gain to be had from the preceding computations since one will need to solve increasingly larger linear systems as the loop runs into higher values of i. Nevertheless, it immediately shows how to exploit sparsity if we set some of the elements in the lower triangular part of to be zero. For example, suppose we set at most m elements in each row of to be nonzero. Let N[i] be the set of indices such that . We can compute the nonzero elements of and the diagonal elements of much more efficiently as:
[TABLE]
In (28) we solve n-1 linear systems of size at most . This can be performed in flops, whereas the earlier pseudocode in (22) for the dense model required flops. These computations can be performed in parallel as each iteration of the loop is independent of the others.
The above discussion provides a very useful strategy for introducing sparsity in a precision matrix. Let and both be dense positive definite matrices. Suppose we use the pseudocode in (28) with to construct a sparse strictly lower-triangular matrix with no more than non-zero entries in each row, where is considerably smaller than , and the diagonal matrix . The resulting matrix is a covariance matrix whose inverse is sparse. Figure 4 presents a visual representation of the sparsity. While need not be sparse, the density is cheap to compute since is sparse and is calculated from (28). Therefore, one way to achieve massive scalability for models such as (2) is to assume that has prior instead of .
3.2 From distributions to processes
If we are interested in estimating the spatial or spatiotemporal process parameters from a finite collection of random variables, then we can use the approach in Section 3.1 with . In spatial settings, matters are especially convenient as we can delete the edges in the DAG based upon the distances among ’s. In fact, one can decide to retain at most of the nearest neighbors for each location and delete all remaining edges. This implies that the -th element of in Section 3.1 will be nonzero only if is one of the nearest neighbors of . In fact, this idea has been effectively used to construct composite likelihoods for Gaussian process models by Vecchia (1988) and Stein et al. (2004), while Stroud et al. (2017) exploits this idea to propose preconditioned conjugate gradient algorithms for Bayesian and maximum likelihood estimates on large incomplete lattices.
Localized Gaussian process regression based on few nearest neighbors has also been used to obtain fast kriging estimates. Emery (2009) provides fast updates for kriging equations after adding a new location to the input set. Iterative application of their algorithm yields a localized kriging estimate based on a small set of locations (including few nearest neighbors). The local estimate often provides an excellent approximation to the global kriging estimate which uses data observed at all the locations to predict at a new location. However, this assumes that the parameters associated with the mean and covariance of the GP are known or already estimated. Local Approximation GP, or LAGP (Gramacy and Apley, 2015; Gramacy and Haaland, 2016; Gramacy, 2016), extends this further to estimate the parameters at each new location, essentially providing a non-stationary local approximation to a Gaussian Process at every predictive location and can be used to interpolate or smooth the observed data.
If, however, posterior predictive inference is sought at arbitrary spatiotemporal resolutions, i.e., for the entire process , then the ideas in Section 3.1 need to be extended to process-based models. Recently, Datta et al. (2016a) proposed a Nearest Neighbor Gaussian Process (NNGP) for modeling large spatial data. NNGP is a well defined Gaussian Process over a domain and yields finite dimensional Gaussian densities with sparse precision matrices. This has been also extended to a dynamic NNGP with dynamic neighbor selection for massive spatiotemporal data (Datta et al., 2016b). The NNGP delivers massive scalability both in terms of parameter estimation and kriging. Unlike low rank processes, it does not oversmooth and accurately emulates the inference from full rank GPs.
We will construct the NNGP in two steps. First, we specify a multivariate Gaussian distribution over a fixed finite set points in , say , which we call the reference set. The reference set can be very large. It can be a fine grid of points over or one can simply take and let be the set of observed points in . We require that the inverse of the covariance matrix be sparse and computationally efficient. Therefore, we specify that , where is the vector with elements and is a covariance matrix such that is sparse. The matrix is constructed from a dense covariance matrix as described in Section 3.1. This provides a highly effective approximation (Vecchia, 1988; Stein et al., 2004) as below:
[TABLE]
where history sets so that is the empty set and for and we have much smaller neighbor sets for each in . We have legitimate probability models for any choice of ’s as long as . One easy specification is to define as the set of nearest neighbors of among the points in . Therefore,
[TABLE]
If denotes the limiting size of the neighbor sets , then has at most non-zero elements. Hence, the approximation in (29) produces a sparsity-inducing proper prior distribution for random effects over that closely approximates the realizations from a .
To construct the NNGP we extend the above model to arbitrary locations. We define neighbor sets for any as the set of nearest neighbors of in . Thus, and the process can be derived from or, equivalently, by writing
[TABLE]
where whenever , is a process independent of , for any two distinct points in , and
[TABLE]
Taking conditional expectations in (30) yields which implies that for each the nonzero ’s are obtained by solving an linear system. The above construction ensures that is a legitimate Gaussian process whose realizations over any finite collection of arbitrary points in will have a multivariate normal distribution with a sparse precision matrix. More formal developments and technical details in the spatial and spatiotemporal settings can be found in Datta et al. (2016a) and Datta et al. (2016b), respectively.
One point worth considering is the definition of “neighbors.” There is some flexibility here. In the spatial setting, the correlation functions usually decay with increasing inter-site distance, so the set of nearest neighbors based on the inter-site distances represents locations exhibiting highest correlation with the given locations. For example, on the plane one could simply use the Euclidean metric to construct neighbor sets, although Stein et al. (2004) recommends including a few points that are farther apart. The neighbor sets can be fixed before the model fitting exercise.
In spatiotemporal settings, matters are more complicated. Spatiotemporal covariances between two points typically depend on the spatial as well as the temporal lag between the points. Non-separable isotropic spatiotemporal covariance functions can be written as where and . This often precludes defining any universal distance function such that will be monotonic with respect to for all choices of . This makes it difficult to define universal nearest neighbors in spatiotemporal domains. To obviate this hurdle, Datta et al. (2016b) define “nearest neighbors” in a spatiotemporal domain using the spatiotemporal covariance function itself as a proxy for distance. This can work for arbitrary domains. For any three points , and , we say that is nearer to than to if . Subsequently, this definition of “distance” is used to find nearest neighbors for any location. Prediction at any arbitrary location is performed by sampling from the posterior predictive distribution. However, for every point , its neighbor set will now depend on and can change from iteration to iteration in the estimation algorithm. If were known, one could have simply evaluated the pairwise correlations between any point in and all points in its history set to obtain — the set of true nearest neighbors. In practice, however, is unknown and for every new value of in an iterative algorithm, we need to search for the neighbor sets within the history sets. Since the history sets are quite large, searching the entire space for nearest neighbors in each iteration will be computationally unfeasible. Datta et al. (2016b) offer some smart strategies for selecting spatiotemporal neighbors. They propose restricting the search for the neighbor sets to carefully constructed small subsets of the history sets. These small eligible sets are constructed in such a manner that, despite being much smaller than the history sets, they are guaranteed to contain the true nearest neighbor sets. This strategy works when we choose to be a perfect square and the original nonseparable covariance function satisfies natural monotonicity, i.e. is decreasing in for fixed and decreasing in for fixed . All Matèrn-based space-time separable covariances and many non-separable classes of covariance functions possess this property (Stein, 2013; Omidi and Mohammadzadeh, 2015).
3.3 Hierarchical NNGP models
We briefly turn to model fitting and estimation. For the approximation in (29) to be effective, the size of the reference set, , needs to be large enough to represent the spatial domain. However, this does not impede computations involving NNGP models because the storage and number of floating point operations are always linear in . The reference set can, in principle, be any finite set of locations in the study domain. A particularly convenient choice, in practice, is to simply take to be the set of observed locations in the dataset. Datta et al. (2016a) demonstrate through extensive simulation experiments and a real application that this simple choice seems to be very effective.
Since the NNGP is a proper Gaussian process, we can use it as a prior for the spatial random effects in any hierarchical model. We write , where is the covariance function for the NNGP (see Datta et al., 2016a, for a closed form expression). For example, with and the set of observed locations, one can build a scalable Bayesian hierarchical model exactly as with a usual spatial process, but assigning an NNGP to the spatial random effects. Here is a simple NNGP-based spatial model with a first stage exponential family model:
[TABLE]
where is an exponential family distribution with link function . Posterior sampling from (34) is customarily performed using Gibbs sampling with Metropolis steps. Computational benefits emerge from the fact that the full conditional distribution and since is an subset of . Prediction at any arbitrary location is performed by sampling from the posterior predictive distribution. For each draw of from , we draw a from and from , where is the vector of observed outcomes and is a vector of the nonzero ’s in (30).
Another, even simpler, example could be modeling a continuous outcome itself as an NNGP. Let the desired full GP specification be . We derive the NNGP from this and obtain
[TABLE]
The above model is extremely fast. The likelihood is of the form , where is sparse and and are obtained from (28) efficiently in parallel. The parameter space of interest is , which is much smaller than for (34) where the latent spatial process also was unknown. While (35) does not separate the residuals into a spatial process and a measurement error process, one can still include measurement error variance, or the nugget, in (35). Here, one would absorb the nugget into . For example, we could write the likelihood in (1) as , where , is a spatial correlation matrix and . These will also feature in the derived NNGP covariance matrix . We can predict the outcome at an arbitrary point by sampling from the posterior predictive distribution as follows: for each draw of from , we draw a from . Note, however, that there is no latent smooth process in (35) and inference on the latent spatial process is precluded.
Likelihood computations in NNGP models usually involve flops. One does not need to store matrices, only matrices which leads to storage . Substantial computational savings accrue because is usually very small. Datta et al. (2016a) demonstrate that fitting NNGP models to the simulated data in Figure 2 with number of neighbors as less as produce posterior estimates of the spatial surface indistinguishable from Figures 2(a) and 2(b). In fact, simulation experiments in Datta et al. (2016a) and Datta et al. (2016b) also affirm that can usually be taken to be very small compared to ; there seems to be no inferential advantage to taking to exceed 15, even for datasets with over spatial locations. For example, Figure 5 shows the posterior credible intervals for a series of 10 simulation experiments where the true effective range was fixed at values from 0.1 to 1.0 in increments of 0.1. Each dataset comprised points. Even with neighbors, the credible intervals for the effective spatial range from the NNGP model were very consistent with those from the full GP model. Datta et al. (2016a) present simulations using the Matérn and other covariance functions revealing very similar behavior.
Another important point to note is that is not invariant to the order in which we define (i.e., the topological order). Vecchia (1988) and Stein et al. (2004) both assert that the approximation in (29) is not sensitive to this ordering. This is corroborated by simulation experiments by Datta et al. (2016a), but a recent manuscript by Guinness (2016) has indicated sensitivity to the ordering in terms of model deviance. We conducted some preliminary investigations to investigate the effect of the topological order. In one simple experiment we generated data from the “true” model in (1) for spatial locations arranged over an grid. The parameter in (1) was set to [math], the covariance function was specified as , and with the true values of , and given in the second column of Table 2. Four different NNGP models corresponding to (35) with derived from and having elements , were fitted to the simulated data. Each of these models were constructed with nearest neighbors, but with different ordering of the points . These were performed according to the sum of the coordinates , a maximum-minimum distance (MMD) proposed by Guinness (2016), the coordinate, and the coordinate. Table 2 presents a comparison of these NNGP models. Irrespective of the ordering of the points, the inference with respect to parameter estimates and predictive performance is extremely robust and effectively indistinguishable from each other. However, the posterior mean of the Kullback-Leibler divergence of these models from the true generating model revealed that the metric proposed by Guinness (2016) is indeed less than the other three. Further explorations are currently being conducted to see how this behavior changes for more complex nonstationary models and in more general settings.
4 Discussion and future directions
The article has attempted to provide some insight into constructing highly scalable Bayesian hierarchical models for very large spatiotemporal datasets using low-rank and sparsity-inducing processes. Such models are increasingly being employed to answer complex scientific questions and analyze massive spatiotemporal datasets in the natural and environmental sciences. Any standard Bayesian estimation algorithm, such as Markov chain and Hamiltonian Monte Carlo (see, e.g., Robert and Casella, 2004; Brooks et al., 2011; Gelman et al., 2013; Neal, 2011; Hoffman and Gelman, 2014), Integrated Nested Laplace Approximations (Rue et al., 2009), and Variational Bayes (see, e.g., Bishop, 2006) can be used for fitting these models. The models ensure that the algorithmic complexity has floating point operations (flops), where the number of spatial locations (per iteration). Storage requirements are also linear in . Methods such as the multiresolution predictive process (Katzfuss, 2017) and the NNGP (Datta et al., 2016a) can scale up to datasets in the order of spatial and/or temporal points without sacrificing richness in the model.
While the NNGP certainly seem to have an edge in scalability over the more conventional low-rank or fixed rank models, it is premature to say whether its inferential performance will always excel over low rank of fixed rank models. For example, analyzing complex nonstationary random fields may pose challenges regarding construction of neighbor sets as simple distance-based definition of neighbors may prove to be inadequate. Multiresolution basis functions may be more adept at capturing nonstationary, but may struggle with massive datasets. Dynamic neighbor selection for nonstationary fields, where neighbors will be chosen based upon the covariance kernel itself, analogous to Datta et al. (2016b) for space-time covariance functions, may be an option worth exploring. Multiresolution NNGPs, where the residual from the NNGP approximation is modeled hierarchically (analogous to Katzfuss, 2017, for the predictive process) may also be promising in terms of full Bayesian inference at massive scales.
There remain other challenges in high-dimensional geostatistics. Here, we have considered geostatistical settings where we have very large numbers of locations and/or time-points, but restricted our discussion to univariate outcomes. In practice, we often observe a variate response along with a set of explanatory variables and variate GP, , is used to capture the spatial patterns beyond the observed covariates. We seek to capture associations among the variables as well as the strength of spatiotemporal association for each outcome. One specific geostatistical problem in ecology that currently lacks a satisfying solution is a joint species distribution model, where we seek to model a large collection of species (say, order ) over a large collection of spatial sites (again, say, order ).
The linear model of coregionalization (LMC) proposed by Matheron (1982) is among the most general models for multivariate spatial data analysis. Here, the spatial behavior of the outcomes is assumed to arise from a linear combination of the independent latent processes operating at different spatial scales (Chilés and Delfiner, 1999). The idea resembles latent factor analysis (FA) models for multivariate data analysis (e.g., Anderson, 2003) except that in the LMC the number of latent processes is usually taken to be the same as the number of outcomes. Then, an covariance matrix has to be estimated for each spatial scale (see, e.g., Lark and Papritz, 2003; Castrignanó et al., 2005; Zhang, 2007), where is the number of outcomes. When is large (e.g., and 300 spatial locations), obtaining such estimates is expensive. Schmidt and Gelfand (2003) and Gelfand et al. (2004) associate only a triangular matrix with the latent processes. However, high dimensional outcomes are still computationally prohibitive for these models.
Spatial factor models (see, e.g., Lopes and West, 2004; Lopes et al., 2008; Wang and Wall, 2003) have been used to handle high dimensional outcomes but with modest number of spatial locations. Dimension reduction is needed in two aspects: (i) the length of the vector of outcomes, and (ii) the very large number of spatial locations. Latent variable (factor) models are usually used to address the former, while low-rank spatial processes offer a rich and flexible modeling option for dealing with a large number of locations. Ren and Banerjee (2013) have exploited these two ideas to propose a class of hierarchical low-rank spatial factor models and also explored stochastic selection of the latent factors without resorting to complex computational strategies (such as reversible jump algorithms) by utilizing certain identifiability characterizations for the spatial factor model. Their model was designed to capture associations among the variables as well as the strength of spatial association for each variable. In addition, they reckoned with the common setting where not all the variables have been observed over all locations, which leads to spatial misalignment. The fully Bayesian approach effectively deals with spatial misalignment, but is likely to suffer from the limited ability of low-rank models to scale to a very large number of locations. Promising ideas include using the multiresolution predictive process or the NNGP as a prior on the spatial factors.
Computational developments with regard to Markov chain Monte Carlo (MCMC) algorithms (see, e.g., Robert and Casella, 2004) have contributed enormously to the dissemination of Bayesian hierarchical models in a wide array of disciplines. Spatial modeling is no exception. However, the challenges for automated implementation of geostatistical model fitting and inference are substantial. First, expensive matrix computations are required that can become prohibitive with large datasets. Second, routines to fit unmarginalized models are less suited for direct updating using a Gibbs sampler and result in slower convergence of the chains. Third, investigators often encounter multivariate spatial datasets with several spatially dependent outcomes, whose analysis requires multivariate spatial models that involve demanding matrix computations. These issues have, however, started to wane with the delivery of relatively simpler software packages in the R statistical computing environment via the Comprehensive R Archive Network (CRAN) (http://cran.r-project.org). Several packages that automate Bayesian methods for point-referenced data and diagnose convergence of MCMC algorithms are easily available from CRAN. Packages that fit Bayesian models include geoR, geoRglm, spTimer, spBayes, spate, and ramps.
In terms of the hierarchical geostatistical models presented in this article, spBayes offers users a suite of Bayesian hierarchical models for Gaussian and non-Gaussian univariate and multivariate spatial data as well as dynamic Bayesian spatio-temporal models. It focuses upon performance issues for full Bayesian inference, sampler convergence rate and efficiency using a collapsed Gibbs sampler, decreasing sampler run-time by avoiding expensive matrix computations, and increased scalability to large datasets by implementing predictive process models. Beyond these general computational improvements for existing models, it analyzes data indexed both in space and time using a class of dynamic spatiotemporal models, and their predictive process counterparts, for settings where space is viewed as continuous and time is taken as discrete. Finally, we have modeling environments such as Nimble (de Valpine et al., 2017) that gives users enormous flexibility to choose algorithms for fitting their models, and Stan (Carpenter et al., 2017) that estimates Bayesian hierarchical models using Hamiltonian dynamics. The NNGP and the predictive process can be also coded in Nimble and Stan fairly easily.
Acknowledgments
The author wishes to thank Professor Bruno Sansó and two anonymous reviewers for very constructive and insightful feedback. In addition, the author also wishes to thank Dr. Abhirup Datta, Dr. Andrew O. Finley and Ms. Lu Zhang for useful discussions. The work of the author was supported in part by NSF DMS-1513654 and NSF IIS-1562303.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Anderson (2003) Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis . New York, NY: Wiley, third edition.
- 2Banerjee et al. (2014) Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014). Hierarchical Modeling and Analysis for Spatial Data . Boca Raton, FL: Chapman & Hall/CRC, second edition.
- 3Banerjee et al. (2010) Banerjee, S., Finley, A. O., Waldmann, P., and Ericcson, T. (2010). “Hierarchical Spatial Process Models for Multiple Traits in Large Genetic Trials.” Journal of the American Statistical Association , 105: 506–521.
- 4Banerjee et al. (2008) Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). “Gaussian Predictive Process Models for Large Spatial Datasets.” Journal of the Royal Statistical Society, Series B , 70: 825–848.
- 5Barry and Ver Hoef (1996) Barry, R. and Ver Hoef, J. (1996). “Blackbox kriging: Spatial prediction without specifying variogram models.” Journal of Agricultural, Biological and Environmental Statistics , 1: 297–322.
- 6Bevilacqua and Gaetan (2014) Bevilacqua, M. and Gaetan, C. (2014). “Comparing Composite Likelihood Methods Based on Pairs for Spatial Gaussian Random Fields.” Statistics and Computing , 1–16.
- 7Bishop (2006) Bishop, C. (2006). Pattern Recognition and Machine Learning . New York, NY: Springer-Verlag.
- 8Brooks et al. (2011) Brooks, S., Gelman, A., Jones, G. L., and Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo . Boca Raton, FL: CRC Press.
