Vecchia-Laplace approximations of generalized Gaussian processes for big   non-Gaussian spatial data

Daniel Zilber; Matthias Katzfuss

arXiv:1906.07828·stat.ME·December 22, 2020·Comput. Stat. Data Anal.

Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data

Daniel Zilber, Matthias Katzfuss

PDF

2 Repos

TL;DR

This paper introduces a scalable and accurate approximation method for generalized Gaussian processes that efficiently handles large non-Gaussian spatial datasets by combining Laplace and Vecchia approximations.

Contribution

It proposes a novel Vecchia-Laplace approximation for GGPs, enabling scalable inference for large non-Gaussian spatial data sets.

Findings

01

The method is computationally efficient for large datasets.

02

It provides accurate inference compared to existing methods.

03

Implemented in an accessible R package.

Abstract

Generalized Gaussian processes (GGPs) are highly flexible models that combine latent GPs with potentially non-Gaussian likelihoods from the exponential family. GGPs can be used in a variety of settings, including GP classification, nonparametric count regression, modeling non-Gaussian spatial data, and analyzing point patterns. However, inference for GGPs can be analytically intractable, and large datasets pose computational challenges due to the inversion of the GP covariance matrix. We propose a Vecchia-Laplace approximation for GGPs, which combines a Laplace approximation to the non-Gaussian likelihood with a computationally efficient Vecchia approximation to the GP, resulting in a simple, general, scalable, and accurate methodology. We provide numerical studies and comparisons on simulated and real spatial data. Our methods are implemented in a freely available R package.

Tables3

Table 1. Table 1 : Examples of popular likelihoods, together with the Gaussian pseudo-data and pseudo-variances implied by the Laplace approximation. The non-canonical logarithmic link function is used for the Gamma likelihood to ensure that the second parameter, a e − y 𝑎 superscript 𝑒 𝑦 ae^{-y} , is positive.

distribution	likelihood $g (z \| y)$	pseudo-data $t_{y}$	pseudo-variance $d (y)$
Gaussian	$𝒩 (y, τ^{2})$	$z$	$τ^{2}$
Bernoulli	$ℬ ({logit}^{- 1} (y))$	$y + \frac{{(1 + e^{y})}^{2}}{e^{y}} (z - \frac{e^{y}}{1 + e^{y}})$	$(1 + e^{- y}) (1 + e^{y})$
Poisson	$𝒫 (e^{y})$	$y + e^{- y} (z - e^{y})$	$e^{- y}$
Gamma	$𝒢 (a, a e^{- y})$	$y + (1 - z^{- 1} e^{y})$	$a z e^{- y}$

Table 2. Table 2 : For 100 simulated Poisson datasets at n = 625 𝑛 625 n=625 locations in the unit square, RMSE for parameter estimates based on different approximations to the integrated likelihood. Both range and smoothness parameters were bounded to the interval [ 0.001 , 20 ] 0.001 20 [0.001,20] , but LowRank estimation still failed repeatedly.

	Range			Smoothness
	$m = 10$	$m = 20$	$m = 40$	$m = 10$	$m$ =20	$m$ =40
LowRank	0.107	0.407	0.098	9.17	$8.40$	$12.60$
VL	0.293	0.040	0.023	1.01	$0.78$	$0.47$
Laplace			0.023			$0.51$

Table 3. Table 3 : For the MODIS data, comparison of prediction scores (lower is better) between VL and LowRank

Method	MSE	CRPS
VL	0.0149	0.144
LowRank	0.0528	0.170
Ratio	3.54 $\times$	1.18 $\times$

Equations45

p (y ∣ z) = \frac{N _{n} ( y ∣ μ , K ) \prod _{i = 1}^{n} g _{i} ( z _{i} ∣ y _{i} )}{p ( z )},

p (y ∣ z) = \frac{N _{n} ( y ∣ μ , K ) \prod _{i = 1}^{n} g _{i} ( z _{i} ∣ y _{i} )}{p ( z )},

\textstyle u_{i}(y_{i})=\frac{\partial}{\partial y_{i}}\log g_{i}(z_{i}|y_{i})\qquad\text{and}\qquad d_{i}(y_{i})=-\big{(}\frac{\partial^{2}}{\partial y_{i}^{2}}\log g_{i}(z_{i}|y_{i})\big{)}^{-1},\qquad i=1,\ldots,n.

\textstyle u_{i}(y_{i})=\frac{\partial}{\partial y_{i}}\log g_{i}(z_{i}|y_{i})\qquad\text{and}\qquad d_{i}(y_{i})=-\big{(}\frac{\partial^{2}}{\partial y_{i}^{2}}\log g_{i}(z_{i}|y_{i})\big{)}^{-1},\qquad i=1,\ldots,n.

p_{L} (y ∣ z) = N_{n} (y ∣ α, W_{α}^{- 1}) .

p_{L} (y ∣ z) = N_{n} (y ∣ α, W_{α}^{- 1}) .

\textstyle\mathbf{h}(\mathbf{y})=\mathbf{y}-\big{(}\frac{\partial^{2}}{\partial\mathbf{y}\partial\mathbf{y}^{\prime}}\log p(\mathbf{y}|\mathbf{z})\big{)}^{-1}\big{(}\frac{\partial}{\partial\mathbf{y}}\log p(\mathbf{y}|\mathbf{z})\big{)}.

\textstyle\mathbf{h}(\mathbf{y})=\mathbf{y}-\big{(}\frac{\partial^{2}}{\partial\mathbf{y}\partial\mathbf{y}^{\prime}}\log p(\mathbf{y}|\mathbf{z})\big{)}^{-1}\big{(}\frac{\partial}{\partial\mathbf{y}}\log p(\mathbf{y}|\mathbf{z})\big{)}.

h (y) = μ + W_{y}^{- 1} D_{y}^{- 1} (t_{y} - μ) = E (y ∣ t_{y}),

h (y) = μ + W_{y}^{- 1} D_{y}^{- 1} (t_{y} - μ) = E (y ∣ t_{y}),

p_{L} (z ∣ y) = p (t_{α} ∣ y) = N_{n} (t_{α} ∣ y, D_{α}),

p_{L} (z ∣ y) = p (t_{α} ∣ y) = N_{n} (t_{α} ∣ y, D_{α}),

p_{L} (y ∣ z) = p (y ∣ t_{α}) = N_{n} (y ∣ α, W_{α}^{- 1}) .

p_{L} (y ∣ z) = p (y ∣ t_{α}) = N_{n} (y ∣ α, W_{α}^{- 1}) .

p (x) = \prod_{i = 1}^{2 n} p (x_{i} ∣ x_{c (i)}),

p (x) = \prod_{i = 1}^{2 n} p (x_{i} ∣ x_{c (i)}),

p_{I W} (x) = i = 1 \prod n p (t_{i} ∣ y_{i}) p (y_{i} ∣ y_{q_{y} (i)}, t_{q_{t} (i)}) .

p_{I W} (x) = i = 1 \prod n p (t_{i} ∣ y_{i}) p (y_{i} ∣ y_{q_{y} (i)}, t_{q_{t} (i)}) .

p_{R F} (x) = i = 1 \prod n p (t_{i}) p (y_{i} ∣ y_{q_{y} (i)}, t_{q_{t} (i)}) .

p_{R F} (x) = i = 1 \prod n p (t_{i}) p (y_{i} ∣ y_{q_{y} (i)}, t_{q_{t} (i)}) .

p_{V L} (y ∣ z) = N_{n} (y ∣ α_{V}, W_{V}^{- 1}) .

p_{V L} (y ∣ z) = N_{n} (y ∣ α_{V}, W_{V}^{- 1}) .

L (θ) = p (z ∣ θ) = \int p (z ∣ y, θ) p (y ∣ θ) d y .

L (θ) = p (z ∣ θ) = \int p (z ∣ y, θ) p (y ∣ θ) d y .

L_{L} (θ) = p_{L} (z) = \frac{p ( y , t )}{p ( y ∣ t )} \cdot \frac{p ( z ∣ y )}{p ( t ∣ y )} = p (t) \cdot \frac{p ( z ∣ y )}{p ( t ∣ y )},

L_{L} (θ) = p_{L} (z) = \frac{p ( y , t )}{p ( y ∣ t )} \cdot \frac{p ( z ∣ y )}{p ( t ∣ y )} = p (t) \cdot \frac{p ( z ∣ y )}{p ( t ∣ y )},

- 2 lo g p_{I W} (t) = - 2 \sum_{i} lo g U_{ii} + 2 \sum_{i} lo g V_{ii} + \tilde{t}^{'} \tilde{t} - \overset{˘}{t}^{'} \overset{˘}{t} + n lo g (2 π),

- 2 lo g p_{I W} (t) = - 2 \sum_{i} lo g U_{ii} + 2 \sum_{i} lo g V_{ii} + \tilde{t}^{'} \tilde{t} - \overset{˘}{t}^{'} \overset{˘}{t} + n lo g (2 π),

L_{V L} (θ) = p_{V L} (z ∣ θ) = p_{I W} (t) i = 1 \prod n \frac{g _{i} ( z _{i} ∣ y _{i} )}{N ( t _{i} ∣ y _{i} , d _{i} )} .

L_{V L} (θ) = p_{V L} (z ∣ θ) = p_{I W} (t) i = 1 \prod n \frac{g _{i} ( z _{i} ∣ y _{i} )}{N ( t _{i} ∣ y _{i} , d _{i} )} .

μ (A_{i}) = \int_{A_{i}} λ (s) d s \approx ∣ A_{i} ∣ λ (a_{i}) = ∣ A_{i} ∣ e^{y (a_{i})} .

μ (A_{i}) = \int_{A_{i}} λ (s) d s \approx ∣ A_{i} ∣ λ (a_{i}) = ∣ A_{i} ∣ e^{y (a_{i})} .

z (s_{i}) ∣ y (s_{i}) \sim in d . G (a, a e^{- y (s_{i})}), y (\cdot) \sim N (μ, K),

z (s_{i}) ∣ y (s_{i}) \sim in d . G (a, a e^{- y (s_{i})}), y (\cdot) \sim N (μ, K),

\textstyle\mathbf{h}(\mathbf{y})=\mathbf{y}-\big{(}\frac{\partial^{2}}{\partial\mathbf{y}\mathbf{y}^{\prime}}\log p(\mathbf{y}|\mathbf{z})\big{)}^{-1}\big{(}\frac{\partial}{\partial\mathbf{y}}\log p(\mathbf{y}|\mathbf{z})\big{)}.

\textstyle\mathbf{h}(\mathbf{y})=\mathbf{y}-\big{(}\frac{\partial^{2}}{\partial\mathbf{y}\mathbf{y}^{\prime}}\log p(\mathbf{y}|\mathbf{z})\big{)}^{-1}\big{(}\frac{\partial}{\partial\mathbf{y}}\log p(\mathbf{y}|\mathbf{z})\big{)}.

h (y)

h (y)

\displaystyle=\mathbf{W}^{-1}\big{(}(\mathbf{K}^{-1}+\mathbf{D}^{-1})\mathbf{y}-\mathbf{K}^{-1}\mathbf{y}+(\mathbf{K}^{-1}\bm{\mu}+\mathbf{D}^{-1}\bm{\mu})-\mathbf{D}^{-1}\bm{\mu}+\mathbf{D}^{-1}\mathbf{D}\mathbf{u}\big{)}

\displaystyle=\bm{\mu}+\mathbf{W}^{-1}\big{(}\mathbf{D}^{-1}(\mathbf{y}+\mathbf{D}\mathbf{u}-\bm{\mu})\big{)}

= μ + W^{- 1} D^{- 1} (t - μ),

E (y ∣ t) = (K^{- 1} + D^{- 1})^{- 1} (K^{- 1} μ + D^{- 1} t) = μ + W^{- 1} D^{- 1} (t - μ) .

E (y ∣ t) = (K^{- 1} + D^{- 1})^{- 1} (K^{- 1} μ + D^{- 1} t) = μ + W^{- 1} D^{- 1} (t - μ) .

U_{j i} = ⎩ ⎨ ⎧ r_{i}^{- 1/2}, - b_{i}^{(j)} r_{i}^{- 1/2}, 0, i = j, j \in c (i), otherwise,

U_{j i} = ⎩ ⎨ ⎧ r_{i}^{- 1/2}, - b_{i}^{(j)} r_{i}^{- 1/2}, 0, i = j, j \in c (i), otherwise,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data

Daniel Zilber Department of Statistics, Texas A&M University

Matthias Katzfuss11footnotemark: 1 Corresponding author: [email protected]

Abstract

Generalized Gaussian processes (GGPs) are highly flexible models that combine latent GPs with potentially non-Gaussian likelihoods from the exponential family. GGPs can be used in a variety of settings, including GP classification, nonparametric count regression, modeling non-Gaussian spatial data, and analyzing point patterns. However, inference for GGPs can be analytically intractable, and large datasets pose computational challenges due to the inversion of the GP covariance matrix. We propose a Vecchia-Laplace approximation for GGPs, which combines a Laplace approximation to the non-Gaussian likelihood with a computationally efficient Vecchia approximation to the GP, resulting in simple, general, scalable, and accurate methodology. We provide numerical studies and comparisons on simulated and real spatial data. Our methods are implemented in a freely available R package.

Keywords: Exponential family; Geostatistics; Kriging; Nearest Neighbor; Sparse inverse Cholesky; Spatial Generalized Linear Mixed Model

1 Introduction

Dependent non-Gaussian data are ubiquitous in time series, geospatial applications, and more generally in nonparametric regression and classification. A popular model for such data is obtained by combining a latent Gaussian process (GP) with conditionally independent, potentially non-Gaussian likelihoods from the exponential family. This is traditionally referred to as a spatial generalized linear mixed model (SGLMM) in the spatial statistics literature (Diggle et al., , 1998), but the same model has more recently also been referred to as a generalized GP (GGP; e.g., Chan and Dong, , 2011); we will use the latter, more concise term throughout. GGPs are highly flexible, interpretable, and allow for natural, probabilistic uncertainty quantification. However, inference for GGPs can be analytically intractable, and large datasets pose additional computational challenges due to the inversion of the GP covariance matrix.

Popular techniques to numerically perform the intractable marginalization necessary for inference are, in order of increasing speed: Markov chain Monte Carlo (MCMC), expectation propagation, variational methods, and Laplace approximations. See Shang and Chan, (2013) for a recent review of deterministic techniques and Filippone and Girolami, (2014) for a comparison of MCMC and expectation propagation. Rue et al., (2009) argues that variational methods and expectation propagation suffer from underestimated and overestimated posterior variances, respectively. Here, we consider the Laplace approximation (e.g., Tierney and Kadane, , 1986; Williams and Barber, , 1998), a classic technique for integral evaluation based on second-order Taylor expansion. Bonat and Ribeiro Jr, (2016) show numerically that the Laplace approximation can be a practical and accurate method for GGP inference.

It has long been recognized that the computational cost for GPs is cubic in the data size. Much work has been done on GP approximations that address this problem in the context of Gaussian noise (as recently reviewed in Heaton et al., , 2019). Low-rank approaches (e.g., Higdon, , 1998; Quiñonero-Candela and Rasmussen, , 2005; Banerjee et al., , 2008; Cressie and Johannesson, , 2008; Katzfuss and Cressie, , 2011) have limitations in the presence of fine-scale structure (Stein, , 2014), but they have proved popular due to their simplicity. Approximations relying on sparsity in covariance matrices (Furrer et al., , 2006; Kaufman et al., , 2008) by definition can only capture local, short-range dependence and cannot guarantee low computation cost. Approaches that take advantage of Toeplitz or Kronecker structure (e.g., Dietrich and Newsam, , 1997; Flaxman et al., , 2015; Guinness and Fuentes, , 2017) can be extremely efficient but are not as generally applicable. Recently, methods relying on sparsity in precision matrices (Rue and Held, , 2005; Lindgren et al., , 2011; Nychka et al., , 2015) have gained popularity due to their accuracy and flexibility. In particular, a class of highly promising GP approximations (Vecchia, , 1988; Stein et al., , 2004; Datta et al., , 2016; Guinness, , 2018; Katzfuss and Guinness, , 2019; Katzfuss et al., 2020a, ) rely on ordered conditional independence that can guarantee linear scaling in the data size while resolving dependence at all scales.

There are also a number of existing methods for large non-Gaussian datasets modeled using GGPs. A popular approach is to combine a low-rank GP with an approximation of the non-Gaussian likelihood, as the dimension reduction inherent in the low-rank approximation carries through even to the non-Gaussian case. Sengupta and Cressie, (2013) estimate parameters using an expectation-maximization algorithm with low-rank and Laplace approximations. Sheth et al., (2015) use variational inference to obtain the posterior and select a set of conditioning points for their low-rank approximation. Some methods of dimension reduction, including random sketching (e.g., Yang et al., , 2017) and projection, offer theoretical guarantees and can be combined with MCMC methods for the analysis of non-Gaussian data (e.g., Hughes and Haran, , 2013; Guan and Haran, , 2018), but are still subject to the limitations of low-rank methods. Nickisch et al., (2018) develop state-space models for one-dimensional non-Gaussian time series, which can perform inference in linear time and memory using a set of knots in time, a form of low-rank approximation. Alternate priors (e.g., log gamma priors for count data, Bradley et al., , 2018) are an interesting but specialized approach to completely avoid the intractability issues with GGPs.

Similar to what we shall propose, some authors have combined a sparse-precision approach with a non-Gaussian approximation. A prominent example is Lindgren et al., (2011), in which an integrated nested Laplace approximation (INLA) is combined with a sparse-precision approximation of the GP using its representation as the solution to a stochastic partial differential equation. Datta et al., (2016) proposed to apply the GP approximation of Vecchia, (1988) to a latent GP, but did not provide an explicit algorithm for large non-Gaussian data. While both Lindgren et al., (2011) and Datta et al., (2016) limit the number of nonzero entries per row or column in the precision matrix to a small constant, the computational complexity for decomposing this sparse $n\times n$ matrix is not linear in $n$ , but rather $\mathcal{O}(n^{3/2})$ in two dimensions (Lipton et al., , 1979, Thm. 6), and at least $\mathcal{O}(n^{2})$ in higher dimensions. In the Gaussian setting, this scaling problem can be overcome by applying a Vecchia approximation to the observed data (Vecchia, , 1988) or to the joint distribution of the observed data and the latent GP (Katzfuss and Guinness, , 2019).

To handle both scaling and intractability, we propose a Vecchia-Laplace (VL) approximation for GGPs. The posterior mode necessary for the Laplace approximation is found using the Newton-Raphson algorithm, which can be viewed as iterative GP inference based on Gaussian pseudo-data. At each iteration of our VL algorithm, the joint Gaussian distribution of the pseudo-data and the latent GP realizations is approximated using the general Vecchia framework (Katzfuss and Guinness, , 2019; Katzfuss et al., 2020a, ). By modeling the joint distribution of pseudo-data and GP realizations at each iteration, our VL approach can ensure sparsity and guarantee linear scaling in $n$ for any dimension, overcoming the scaling issues of the sparse-matrix approaches mentioned above.

To our knowledge, we provide the first explicit algorithm extending and applying the highly promising class of general-Vecchia GP approximations to large non-Gaussian data. We believe it to be a useful addition to the literature due to its speed, simplicity, guaranteed numerical performance, and wide applicability (e.g., binary, count, right-skewed, and point-pattern data). In addition, as shown in Katzfuss and Guinness, (2019), the general Vecchia approximation includes many popular GP approximations (e.g., Vecchia, , 1988; Snelson and Ghahramani, , 2007; Finley et al., , 2009; Sang et al., , 2011; Datta et al., , 2016; Katzfuss, , 2017; Katzfuss and Gong, , 2019) as special cases, and so our VL methodology also directly provides extensions of these GP approximations to non-Gaussian data.

The remainder of this document is organized as follows. In Section 2, we review the Laplace approximation and general Vecchia. In Section 3, we introduce and examine our VL method, including parameter inference and predictions at unobserved locations. In Sections 4 and 5, we study and compare the performance of VL on simulated and real data, respectively. Some details are left to the appendix. A separate Supplementary Material document contains Sections LABEL:supp:vllikelihood–LABEL:supp:MODIS with additional derivations, simulations, and discussion. The methods and algorithms proposed here are implemented in the R package GPvecchia (Katzfuss et al., 2020c, ) with sensible default settings, so that a wide audience of practitioners can immediately use the code with little background knowledge. Our results and figures can be reproduced using the code and data at https://github.com/katzfuss-group/GPvecchia-Laplace.

2 Review of existing results

2.1 Generalized Gaussian processes

Let $y(\cdot)\sim GP(\mu,K)$ be a latent Gaussian process with mean function $\mu$ and kernel or covariance function $K$ on a domain $\mathcal{D}\subset\mathbb{R}^{d}$ , $d\in\mathbb{N}^{+}$ . Observations $\mathbf{z}=(z_{1},\ldots,z_{n})^{\prime}$ at locations $\mathbf{s}_{i}\in\mathcal{D}$ are assumed to be conditionally independent, $z_{i}|\mathbf{y}\stackrel{{\scriptstyle ind.}}{{\sim}}g_{i}(z_{i}|y_{i})$ , where $\mathbf{y}=(y_{1},\ldots,y_{n})^{\prime}$ and $y_{i}=y(\mathbf{s}_{i})$ . We assume that the observation densities or likelihoods $g_{i}$ are from the exponential family. Parameters $\bm{\theta}$ in $\mu$ , $K$ , or the $g_{i}$ will be assumed fixed and known for now; for example, $\bm{\theta}$ may contain regression coefficients determining the mean function $\mu$ , or variance, smoothness, and range parameters determining a Matérn covariance $K$ .

Our goal is to obtain an approximation of the posterior of $\mathbf{y}$ , which takes the form

[TABLE]

where $\bm{\mu}=(\mu(\mathbf{s}_{1}),\ldots,\mu(\mathbf{s}_{n}))^{\prime}$ , and $\mathbf{K}$ is an $n\times n$ covariance matrix with $(i,j)$ entry $(\mathbf{K})_{i,j}=K(\mathbf{s}_{i},\mathbf{s}_{j})$ . Once an approximation of the posterior (1) has been obtained, it is conceptually straightforward to extend this result to other quantities of interest, such as the integrated likelihood for inference on parameters $\bm{\theta}$ (see Section 3.2), and prediction of $y(\cdot)$ at unobserved locations (see Section 3.3).

2.2 Review of the Laplace approximation

The normalizing constant $p(\mathbf{z})$ in (1) is not available in closed form for non-Gaussian likelihoods. A popular approach to this issue is the Laplace approximation (e.g., Williams and Barber, , 1998; Rasmussen and Williams, , 2006, Section 3.4), which approximates $p(\mathbf{z})=\int\exp(\log p(\mathbf{z}|\mathbf{y}))p(\mathbf{y})d\mathbf{y}$ via a second-order Taylor expansion of $\log p(\mathbf{z}|\mathbf{y})$ at the mode of the posterior density $p(\mathbf{y}|\mathbf{z})$ . As this results in an exponentiated quadratic form in $\mathbf{y}$ , it is equivalent to a Gaussian approximation of the likelihood. The mode of $\log p(\mathbf{y}|\mathbf{z})$ does not depend on the normalizing constant, and so it can be obtained using standard optimization procedures such as the Newton-Raphson algorithm. The crucial observation for our later developments is that each Newton-Raphson update in the GGP setting is equivalent to computing the posterior mean of $\mathbf{y}$ given Gaussian pseudo-data (e.g., Rasmussen and Williams, , 2006, Sect. 3.4.1). Upon convergence of the algorithm, we have a Laplace approximation for the normalizing constant and a Gaussian approximation for the likelihood, which gives us a Gaussian posterior.

We now go into the details of this approximation. Based on the first and second derivative of $\log g_{i}$ , we define

[TABLE]

Stacking these quantities as $\mathbf{u}_{\mathbf{y}}=\big{(}u_{1}(y_{1}),\ldots,u_{n}(y_{n})\big{)}^{\prime}$ and $\mathbf{D}_{\mathbf{y}}=\operatorname{diag}\big{(}d_{1}(y_{1}),\ldots,d_{n}(y_{n})\big{)}$ , it is easy to see that $\frac{\partial}{\partial\mathbf{y}}\log p(\mathbf{y}|\mathbf{z})=-\mathbf{K}^{-1}(\mathbf{y}-\bm{\mu})+\mathbf{u}_{\mathbf{y}}$ and $-\frac{\partial^{2}}{\partial\mathbf{y}\partial\mathbf{y}^{\prime}}\log p(\mathbf{y}|\mathbf{z})=\mathbf{K}^{-1}+\mathbf{D}_{\mathbf{y}}^{-1}\equalscolon\mathbf{W}_{\mathbf{y}}$ . When given the posterior mode $\bm{\alpha}=\operatorname*{arg\,max}_{\mathbf{y}\in\mathbb{R}^{n}}\log p(\mathbf{y}|\mathbf{z})$ , the combined Gaussian/Laplace approximation of the posterior is

[TABLE]

The subscript $\bm{\alpha}$ in $\mathbf{W}^{-1}_{\bm{\alpha}}$ implies evaluation of $\mathbf{W}_{\mathbf{y}}$ at the mode $\bm{\alpha}$ , rather than at an arbitrary $\mathbf{y}$ . To obtain the mode $\bm{\alpha}$ with the Newton-Raphson algorithm, we start with an initial value $\mathbf{y}^{(0)}$ , and update the current guess for $\ell=0,1,2,\ldots$ until convergence as $\mathbf{y}^{(\ell+1)}=\mathbf{h}(\mathbf{y}^{(\ell)})$ , where

[TABLE]

This Newton-Raphson update is equivalent to computing the posterior mean of $\mathbf{y}$ given Gaussian pseudo-data $\mathbf{t}_{\mathbf{y}}=\mathbf{y}+\mathbf{D}_{\mathbf{y}}\mathbf{u}_{\mathbf{y}}$ with noise covariance matrix $\mathbf{D}_{\mathbf{y}}$ . Specifically, we can write the Newton-Raphson update in (3) as:

[TABLE]

which is the conditional mean of $\mathbf{y}$ given Gaussian pseudo-data $\mathbf{t}_{\mathbf{y}}|\mathbf{y}\sim\mathcal{N}_{n}(\mathbf{y},\mathbf{D}_{\mathbf{y}})$ . The derivation of (4) is straightforward and included in Appendix A for completeness. This means we can obtain the mode $\bm{\alpha}$ by iterating between (a) computing pseudo-data $\mathbf{t}_{\mathbf{y}^{(\ell)}}$ with $i$ th entry $y_{i}^{(\ell)}+d_{i}(y_{i}^{(\ell)})u_{i}(y_{i}^{(\ell)})$ , and (b) obtaining the posterior mean $\mathbf{y}^{(\ell+1)}$ of $\mathbf{y}$ given $\mathbf{t}_{\mathbf{y}^{(\ell)}}$ assuming independent Gaussian noise with variances $d_{1}(y_{1}^{(\ell)}),\ldots,d_{n}(y_{n}^{(\ell)})$ .

Some examples of popular likelihoods and the corresponding pseudo-data and pseudo-variances are summarized in Table 1. The Bernoulli and Poisson cases are also illustrated in Figure 1.

Once the algorithm has converged (i.e., $\bm{\alpha}:=\mathbf{y}^{(\ell+1)}=\mathbf{y}^{(\ell)}$ ), we can use the second-order expansion of the loglikelihood at the mode as a Gaussian approximation of the likelihood based on pseudo-data,

[TABLE]

or combine it with the Laplace approximation to get a Gaussian approximation of the posterior conditional on pseudo-data,

[TABLE]

For conciseness, we henceforth refer to (6) as the “Laplace approximation,” rather than the more precise “combined Gaussian and Laplace approximation.”

2.3 Review of the general Vecchia approximation

The Laplace approximation described in Section 2.2 allows us to deal with non-Gaussian likelihoods, but it still requires decomposing the $n\times n$ matrix $\mathbf{K}$ and thus scales as $\mathcal{O}(n^{3})$ . To achieve computational feasibility even for data sizes $n$ in the tens of thousands or more, we also apply a general Vecchia approximation (Katzfuss and Guinness, , 2019), which we will briefly review here.

Assume that $\mathbf{y}\sim\mathcal{N}_{n}(\bm{\mu},\mathbf{K})$ is a vector of GP realizations and $\mathbf{t}|\mathbf{y}\sim\mathcal{N}_{n}(\mathbf{y},\mathbf{D})$ a vector of noisy data, where $\mathbf{D}$ is diagonal. Then, consider a vector $\mathbf{x}=\mathbf{y}\cup\mathbf{t}$ consisting of the $2n$ elements of $\mathbf{y}$ and $\mathbf{t}$ in some ordering (more details below). It is well known that the density function, $p(\mathbf{x})$ , can be factored into a product of univariate conditional densities, $p(\mathbf{x})=\prod_{i=1}^{2n}p(x_{i}|\mathbf{x}_{1:i-1}).$ The general Vecchia framework extends the approximation of Vecchia, (1988) to the vector $\mathbf{x}$ consisting of latent GP realizations and noisy data, resulting in the approximate density

[TABLE]

where $c(i)\subset\{1,\ldots,i-1\}$ is a conditioning index set of size $m$ (or of size $i-1$ for $i\leq m$ ). A small $m$ can lead to enormous computational savings and good approximations; Schäfer et al., (2020) show that under some settings, the approximation error can be bounded when $m$ increases only polylogarithmically with $n$ . Connections between Vecchia and composite likelihood (e.g., Varin et al., , 2011) are discussed in Katzfuss and Guinness, (2019, Sect. 3.8).

As $y_{i}=y(\mathbf{s}_{i})$ is indexed by location and $t_{i}$ is the corresponding noisy observation, the ordering within $\mathbf{y}$ and within $\mathbf{t}$ is determined by an ordering of the observed locations, $\mathbf{s}_{1},\ldots,\mathbf{s}_{n}$ . We will use a coordinate-based (left-to-right) ordering in one spatial dimension. In higher-dimensional spaces, we recommend a maxmin ordering (Guinness, , 2018; Schäfer et al., , 2017), which sequentially chooses each location in the ordering to maximize the minimum distance to previous locations in the ordering.

By straightforward extension of the proof of Prop. 1 in Katzfuss and Guinness, (2019) to the case $\bm{\mu}\neq\mathbf{0}$ , it can be shown that the approximation in (7) implies a multivariate normal joint distribution, $\widehat{p}(\mathbf{x})=\mathcal{N}(\bm{\mu}_{x},\mathbf{Q}^{-1})$ , where $\bm{\mu}_{x,i}=\mu(\mathbf{s}_{j})$ if $x_{i}=y_{j}$ or $x_{i}=t_{j}$ , $\mathbf{Q}=\mathbf{U}\mathbf{U}^{\prime}$ , and $\mathbf{U}$ is the sparse upper triangular Cholesky factor based on a reverse row-column ordering of $\mathbf{Q}$ . We write this as $\mathbf{U}=\operatorname{rchol}(\mathbf{Q})\colonequals\operatorname{rev}(\operatorname{chol}(\operatorname{rev}(\mathbf{Q})))$ , where $\operatorname{rev}(\cdot)$ reverse-orders the rows and columns of its matrix argument. The nonzero entries of $\mathbf{U}$ are computed directly based on the covariance function $K$ as described in Appendix B.

Let $\mathbf{U}_{y}$ and $\mathbf{U}_{t}$ be the submatrices of $\mathbf{U}$ consisting of the rows of $\mathbf{U}$ corresponding to $\mathbf{y}$ and $\mathbf{t}$ , respectively. Then, the sparse matrix $\mathbf{W}=\mathbf{U}_{y}\mathbf{U}_{y}^{\prime}$ is the general Vecchia approximation to the posterior precision matrix of $\mathbf{y}$ given $\mathbf{t}$ . Defining $\mathbf{V}\colonequals\operatorname{rchol}(\mathbf{W})$ , we can obtain the posterior mean of $\mathbf{y}$ as $\operatorname{E}(\mathbf{y}|\mathbf{t})=\bm{\mu}-(\mathbf{V}^{\prime})^{-1}\mathbf{V}^{-1}\mathbf{U}_{y}\mathbf{U}_{t}^{\prime}(\mathbf{t}-\bm{\mu})$ .

2.4 Ordering in Vecchia approximations

We now describe two specific approximations within the general Vecchia framework, which are based on how the elements of $\mathbf{y}$ and $\mathbf{t}$ are ordered in the vector $\mathbf{x}$ in (7): Interweaved (IW) ordering and response-first (RF) ordering. While other ordering and conditioning schemes can also be used in the Vecchia-Laplace methodology to be introduced in Section 3, we recommend these specific schemes to achieve high accuracy while ensuring linear complexity.

2.4.1 Interweaved (IW) ordering

Vecchia-Interweaved (IW) is the sparse general Vecchia approach proposed for likelihood inference in Katzfuss and Guinness, (2019), reviewed briefly here. It is a special case of general Vecchia in (7), in which $\mathbf{x}=(y_{1},t_{1},y_{2},t_{2},\ldots,y_{n},t_{n})^{\prime}$ is specified using an interweaved ordering of the latent $\mathbf{y}$ and responses $\mathbf{t}$ . We consider the following specific expression for (7):

[TABLE]

If $x_{j}=t_{i}$ , we only condition on $y_{i}$ , because $\mathbf{D}$ is diagonal and therefore $t_{i}$ is conditionally independent of all other variables in $\mathbf{y}$ and $\mathbf{t}$ given $y_{i}$ . If $x_{j}=y_{i}$ , we condition on $\mathbf{y}_{q_{y}(i)}$ and $\mathbf{t}_{q_{t}(i)}$ , where $q(i)=q_{y}(i)\cup q_{t}(i)$ is the conditioning index vector consisting of the indices of the nearest $m$ locations previous to $i$ in the ordering. For splitting $q(i)$ into $q_{y}(i)$ and $q_{t}(i)$ , we attempt to maximize $q_{y}(i)$ while ensuring linear complexity (Katzfuss and Guinness, , 2019). Specifically, for $i=1,\ldots,n$ , we set $q_{y}(i)=(k_{i})\cup(q_{y}(k_{i})\cap q(i))$ , where $k_{i}\in q(i)$ is the index whose latent-conditioning set has the most overlap with $q(i)$ : $k_{i}=\arg\max_{j\in q(i)}|q_{y}(j)\cap q(i)|$ , choosing the closest $k_{i}$ in space to $\mathbf{s}_{i}$ in case of a tie. In one-dimensional space with coordinate ordering, this results in $q_{y}(i)=q(i)=(\max(1,i-m),\ldots,i-1)$ and $q_{z}(i)=\emptyset$ . In higher-dimensional space, we may not be able to condition entirely on $\mathbf{y}$ , so the remaining conditioning indices are assigned to $q_{t}(i)=q(i)\setminus q_{y}(i)$ . These conditioning rules guarantee that $\mathbf{U}$ and $\mathbf{V}$ are both highly sparse with at most $m$ nonzero off-diagonal elements per column. Katzfuss and Guinness, (2019) showed that these matrices, and the resulting posterior mean and precision matrix, can be obtained in $\mathcal{O}(nm^{3})$ time.

2.4.2 Response-first (RF) ordering

For approximating predictions at observed locations in Algorithm 1 in more than one dimension, we recommend the new RF-full method described in Katzfuss et al., 2020a , reviewed briefly here. RF-full orders first all response variables, then all latent variables: $\mathbf{x}=(\mathbf{t}^{\prime},\mathbf{y}^{\prime})^{\prime}=(t_{1},\ldots,t_{n},y_{1},\ldots,y_{n})^{\prime}$ . We consider the following specific expression for (7):

[TABLE]

The responses $t_{i}$ do not condition on anything and are considered independent; this implies a poor approximation to $p(\mathbf{t})$ , but it does not affect the posterior distribution $p(\mathbf{y}|\mathbf{t})$ , which is the relevant quantity for our purposes. We now assume $q(i)=q_{y}(i)\cup q_{t}(i)$ to be set of indices corresponding to the $m$ locations closest to $\mathbf{s}_{i}$ (including $\mathbf{s}_{i}$ ), not considering the ordering. For any $j\in q(i)$ , we then let $y_{i}$ condition on $y_{j}$ if it is ordered previously in $\mathbf{x}$ ; otherwise, we condition on $t_{j}$ . More precisely, we set $q_{y}(i)=\{j\in q(i):j<i\}$ and $q_{t}(i)=\{j\in q(i):j\geq i\}$ . Similar to IW, RF-full inference can be carried out in $\mathcal{O}(nm^{3})$ time (Katzfuss et al., 2020a, ).

3 Vecchia-Laplace methods

We now introduce our Vecchia-Laplace (VL) approximation, which allows fast inference for large datasets modeled using GGPs, by combining the Laplace and general Vecchia approximations reviewed in Section 2.

3.1 The VL algorithm

To apply a Laplace approximation, it is first necessary to find the mode of the posterior density of $\mathbf{y}$ . Rapid convergence to the mode can be achieved using a Newton-Raphson algorithm, which can be viewed as iteratively computing a new value $\mathbf{y}^{(l+1)}$ as the posterior mean of the latent GP realization $\mathbf{y}$ based on Gaussian pseudo-data $\mathbf{t}=\mathbf{t}_{\mathbf{y}^{(l)}}$ , as discussed in Section 2.2. Our VL algorithm applies a general Vecchia approximation $\widehat{p}(\mathbf{x})$ to the joint distribution of $\mathbf{x}=\mathbf{y}\cup\mathbf{t}$ at each iteration $l$ , and computes the posterior mean of $\mathbf{y}$ given $\mathbf{t}$ under this approximate distribution. We recommend IW ordering (Section 2.4.1) in one spatial dimension, and RF ordering (Section 2.4.2) when working in more than one dimension. The resulting VL algorithm is presented as Algorithm 1. After convergence, we obtain the approximation

[TABLE]

Once the algorithm has converged and the posterior mean $\bm{\alpha}_{V}$ and precision $\mathbf{W}_{V}$ have been obtained, the posterior distribution in (9) can be used for estimation of the integrated likelihood (Section 3.2) and for prediction at unobserved locations (Section 3.3). As we will see in our simulation studies later, even for moderate $m$ , the VL procedure in Algorithm 1 essentially finds the exact mode of the posterior.

3.2 Integrated likelihood for parameter inference

In the case of unknown parameters $\bm{\theta}$ in $\mu$ , $K$ , or in the $g_{i}$ , we would like to carry out parameter inference based on the integrated likelihood,

[TABLE]

However, this quantity is exactly the unknown normalizing constant in the denominator of (1), and the integral can generally not be carried out analytically. Instead, we will base parameter inference on the integrated likelihood implied by our VL approximation. In the following, we will again suppress dependence on $\bm{\theta}$ for ease of notation.

First, rearranging terms in (1), we have $p(\mathbf{z})=p(\mathbf{z}|\mathbf{y})p(\mathbf{y})/p(\mathbf{y}|\mathbf{z})$ . The Laplace approximation approximates the posterior in the denominator as $\widehat{p}_{L}(\mathbf{y}|\mathbf{z})=p(\mathbf{y}|\mathbf{t}_{\bm{\alpha}})$ (see (6)). Noting that rearranging the definition of a conditional density gives $p(\mathbf{y})=p(\mathbf{y},\mathbf{t})/p(\mathbf{t}|\mathbf{y})$ , we obtain the Laplace approximation of the integrated likelihood:

[TABLE]

where the terms are evaluated at $\mathbf{y}=\bm{\alpha}$ and $\mathbf{t}=\mathbf{t}_{\bm{\alpha}}$ . In this form, the approximation of the integrated likelihood of the data $\mathbf{z}$ can be interpreted as a product of the integrated likelihood of the Gaussian pseudo-data $p(\mathbf{t})$ , times a correction term given by the ratio of the true likelihood to the Gaussian likelihood of the pseudo-data: $p(\mathbf{z}|\mathbf{y})/p(\mathbf{t}|\mathbf{y})=\prod_{i=1}^{n}g_{i}(z_{i}|y_{i})/\mathcal{N}(t_{i}|y_{i},d_{i})$ .

To achieve scalability, we approximate the density $p(\mathbf{t})=p(\mathbf{x})/p(\mathbf{y}|\mathbf{t})$ as implied by the IW approximation $\widehat{p}_{IW}(\mathbf{x})$ in (8). The resulting expression for $\widehat{p}_{IW}(\mathbf{t})$ is derived in Katzfuss and Guinness, (2019) for the case of $\bm{\mu}=\mathbf{0}$ . We show in Section LABEL:supp:vllikelihood that the approximate density essentially has the same form if the prior mean is not zero:

[TABLE]

where $\tilde{\mathbf{t}}=\mathbf{U}_{t}^{\prime}(\mathbf{t}-\bm{\mu})$ and $\breve{\mathbf{t}}=\mathbf{V}^{-1}\mathbf{U}_{y}\tilde{\mathbf{t}}$ .

Thus, for a specific parameter value $\bm{\theta}$ , we run Algorithm 1 based on $\bm{\theta}$ to obtain $\bm{\alpha}_{V}$ , set $\mathbf{y}=\bm{\alpha}_{V}$ , $\mathbf{t}=\mathbf{t}_{\bm{\alpha}_{V}}$ , and $d_{i}=(\mathbf{D}_{\bm{\alpha}_{V}})_{ii}$ , and then evaluate the VL integrated likelihood as

[TABLE]

We can plug $\mathcal{L}_{VL}(\bm{\theta})$ into any numerical likelihood-based inference procedure, such as numerical optimization for finding the maximum likelihood estimator of $\bm{\theta}$ , or sampling-based algorithms for finding the posterior of $\bm{\theta}$ . In an iterative inference procedure, we recommend initializing $\mathbf{y}^{(0)}$ in Algorithm 1 at the mode $\bm{\alpha}_{V}$ obtained for the previous parameter value. Our integrated likelihood can also be used directly to evaluate the posterior of $\bm{\theta}$ over a grid of high-probability points (Rue et al., , 2009, Sect. 3.1). An extension to the integrated nested Laplace approximation (INLA) that improves the accuracy of the marginal posteriors of the $y_{i}$ (Rue et al., , 2009, Sect. 3.2) is straightforward.

3.3 Predictions at unobserved locations

We now consider making predictions at $n^{\star}$ unobserved locations, $\mathcal{S}^{\star}=\{\mathbf{s}_{1}^{*},\ldots,\mathbf{s}_{n^{\star}}^{*}\}$ , by obtaining the posterior distribution of $\mathbf{y}^{\star}=(y_{1}^{\star},\ldots,y_{n^{\star}}^{\star})^{\prime}$ with $y_{i}^{\star}=y(\mathbf{s}_{i}^{\star})$ . Using the Laplace approximation as expressed in (5), GGP predictions are approximated as GP predictions given Gaussian pseudo-data $\mathbf{t}_{\bm{\alpha}}$ with noise covariance matrix $\mathbf{D}_{\bm{\alpha}}$ .

Hence, to obtain scalable predictions at unobserved locations, we use the recommended prediction methods in Katzfuss et al., 2020a that apply Vecchia approximations to the multivariate normal vector $\widetilde{\mathbf{x}}=\mathbf{t}\cup\mathbf{y}\cup\mathbf{y}^{\star}$ . For one-dimensional space, we use an extension of IW called LF-auto in Katzfuss et al., 2020a , and for higher-dimensional space we use the RF-full method of Katzfuss et al., 2020a . In both cases, the pseudo-data $\mathbf{t}=\mathbf{t}_{\bm{\alpha}_{V}}$ and the noise variances $\mathbf{D}=\mathbf{D}_{\bm{\alpha}_{V}}$ are evaluated at the approximate mode $\bm{\alpha}_{V}$ obtained using Algorithm 1. Based on this approximation, we can compute the implied posterior distribution of $\widetilde{\mathbf{y}}=\mathbf{y}\cup\mathbf{y}^{\star}$ as described in Section 2.3: $\widehat{p}(\widetilde{\mathbf{y}}|\mathbf{t})\sim\mathcal{N}(\widetilde{\bm{\mu}},(\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\prime})^{-1})$ . Katzfuss et al., 2020a describe how to efficiently extract quantities of interest from this distribution, including the posterior mean and variances at unobserved locations. Finally, summaries or samples from the posterior of $\widetilde{\mathbf{y}}$ can be transformed to the data scale using the likelihood function $g(z|y)$ , if desired. Sometimes it is difficult to compute certain predictive summaries at the data scale analytically, but it is always possible to approximate them via sampling.

Algorithm LABEL:alg:practical in Section LABEL:app:algo provides pseudo-code for maximum-likelihood estimation of parameters and for prediction.

3.4 Properties

3.4.1 Complexity

Inference for GPs with independent Gaussian noise using the Vecchia approximations considered here requires $\mathcal{O}(nm^{3})$ time, where $m$ is the maximum size of the conditioning sets $q(i)$ , and can be easily parallelized (Katzfuss and Guinness, , 2019; Katzfuss et al., 2020a, ). Our VL Algorithm 1 iteratively computes the Vecchia approximation multiple times until convergence, only adding $\mathcal{O}(n)$ cost at each iteration for computing the pseudo-data $\mathbf{t}_{\mathbf{y}^{(l)}}$ . Hence, the VL algorithm requires $\mathcal{O}(knm^{3})$ time, where $k$ , the number of iterations required until convergence, can be very small (often, $k<10$ ).

Once $\bm{\alpha}_{V}$ has been determined using Algorithm 1, evaluating the integrated likelihood (11) for parameter inference requires $\mathcal{O}(nm^{3})$ time (Katzfuss and Guinness, , 2019), and prediction at $n^{\star}$ unobserved locations requires $\mathcal{O}((n+n^{\star})m^{3})$ time (Katzfuss et al., 2020a, ). Thus, all computational costs are linear in $n$ for fixed $m$ .

3.4.2 Approximation errors

Our VL approximation $\widehat{p}_{VL}(\mathbf{y}|\mathbf{z})=\mathcal{N}_{n}(\mathbf{y}|\bm{\alpha}_{V},\mathbf{W}_{V}^{-1})$ in (9) has two sources of error relative to the true posterior $p(\mathbf{y}|\mathbf{z})$ : the Vecchia approximation and the Laplace approximation. Both errors are difficult to quantify in general, but our numerical experiments in Section 4 show that our approximation can be very accurate. The error due to the Vecchia approximation can always be reduced by increasing $m$ (e.g., Katzfuss et al., 2020a, ).

The error of the Laplace approximation is known to depend on the likelihood being approximated. Laplace is exact for Gaussian likelihoods, in which case the VL approximation reverts to the general Vecchia approximation. For non-Gaussian spatial data, theoretical error bounds are difficult to obtain (e.g., Rue et al., , 2009, Sect. 4.1). From an empirical point of view, Fong et al., (2010) affirm the non-spatial results of Breslow and Clayton, (1993), showing that INLA, an extension of the Laplace approximation, generally performs well for GGPs, with the exception of some types of binomial data. Bonat and Ribeiro Jr, (2016) provide a thorough simulation study comparing Laplace to MCMC methods for parameter estimation in the case of binomial, Poisson, and negative-binomial spatial data; they conclude that the Laplace approximation is “a safe option” that is computationally practical.

3.4.3 Convergence

For GGPs as described in Section 2.1, the log-posterior in (1) is concave under appropriate parameterizations. Existing results show that the Newton-Raphson algorithm used in the Laplace approximation is then theoretically guaranteed to converge to its mode (e.g., Boyd and Vandenberghe, , 2004, Section 9.5.2). In our VL Algorithm 1, the distribution $\widehat{p}(\mathbf{y})$ implied by the general Vecchia approximation changes at each iteration, which makes it difficult to theoretically guarantee convergence, except in special cases. Fortunately, empirical testing of Algorithm 1 under different parameter and data settings showed that convergence can always be expected when machine precision is not an issue.

4 Simulations and comparisons

We compared our VL approaches to other methods using simulated data. Throughout Section 4, unless specified otherwise, we simulated realizations $\mathbf{y}$ on a grid of size $\sqrt{n}\times\sqrt{n}$ on the unit square from a GP with mean zero and a Matérn covariance function with variance 1, smoothness $\nu$ , and range parameter $\lambda=0.05$ . Gridded locations allow us to carry out simulations for large $n$ using Fourier methods. The data were then generated conditional on $\mathbf{y}$ using the four likelihoods in Table 1, with $a=2$ in the Gamma case.

As low-rank approximations are very popular for large spatial data, we also considered a fully independent conditional or modified-predictive-process approximation to Laplace with $m$ knots (abbreviated as LowRank here), which is equivalent to VL-IW except that each conditioning set $q_{y}(i)=(1,\ldots,m)$ simply consists of the first $m$ latent variables in maxmin ordering. This equivalence allowed us to run VL and LowRank using the same code base, thus avoiding differences solely due to programming.

Criteria used for comparison are the run time (on a 2017 MacBook Pro), the relative root mean square error (RRMSE) and the difference in log scores (dLS). Results are averaged over 100 simulated datasets, unless noted otherwise. The RRMSE is the root mean square error of the posterior mean of $\mathbf{y}$ obtained by one of the approximation methods relative to the true simulated $\mathbf{y}$ , divided by the RMSE of the Laplace approximation. The log score is computed as the negative logarithm of the approximated posterior density of $\mathbf{y}$ evaluated at the true $\mathbf{y}$ , with low values corresponding to well calibrated and sharp posterior distributions (e.g., Gneiting and Katzfuss, , 2014, Sect. 3). The dLS is the log score of an approximation method minus the log score for the Laplace approximation. When averaged over a sufficient number of simulated data, the dLS can be shown to approximate the difference between the Kullback-Leibler (KL) divergence of the exact posterior distribution and the considered approximation, minus the KL divergence between the exact distribution and the Laplace approximation.

4.1 Comparison to MCMC

Non-Gaussian spatial models are often fitted using Markov chain Monte Carlo (MCMC), which under mild regularity conditions is “exact approximate,” converging to the true posterior as the number of iterations approaches infinity. For finite computation time and large $n$ , however, MCMC results can be very poor relative to the Laplace approximation. We demonstrate this with a single simulated dataset consisting of $n=625$ Bernoulli observations based on a GP with smoothness parameter $\nu=0.5$ on the unit square. We compared Laplace and VL-RF to Hamiltonian Monte Carlo (HMC; Neal et al., , 2011), a MCMC method well suited to sampling correlated variables. As shown in Figure 2, VL quickly achieved the same accuracy as Laplace as $m$ increased, but at a fraction of the computing time. In contrast, HMC took orders of magnitude longer to achieve similar accuracy. Even with 1 million iterations, the RMSE for HMC was slightly higher than for VL; this is in line with existing simulation studies suggesting that the Laplace approximation error may be negligible in many GGP settings (see Section 3.4.2). We expect the relative performance of HMC to degrade further as $n$ increases. More details and results can be found in Section LABEL:supp:sim.

4.2 Computational scaling of Laplace approximations

While the Laplace approximation is very useful for moderate data sizes $n$ , we now briefly illustrate the computational infeasibility for large $n$ due to its cubic scaling. In Figure 3, we show the average computation time for observations with smoothness $\nu=0.5$ in the setting described later in Section 4.4. Clearly, Laplace using Newton-Raphson quickly became infeasibly slow as $n$ increased. In contrast, VL and LowRank scaled roughly linearly.

4.3 VL accuracy in one-dimensional space

We now compare the accuracy of the VL and LowRank approximations. Both approaches scale linearly in $n$ for fixed $m$ , and both approaches converge to the Laplace approximation as $m$ increases, with equivalence guaranteed for $m=n-1$ .

Figure 4 shows the average results for 100 simulated datasets of size $n=2{,}500$ each, on the unit interval. For the Gaussian likelihood, the noise variance was $\tau^{2}=0.1^{2}$ . Clearly, VL-IW was extremely accurate and delivered essentially equivalent results to the Laplace approximation, even for very small $m$ . For exponential covariance (i.e., Matérn with smoothness $\nu=0.5$ ), an exact screening effect holds in one-dimensional space, and so VL-IW is exactly equal to Laplace for any $m\geq 1$ . LowRank required much larger $m$ to achieve equivalence to Laplace.

4.4 VL accuracy in two-dimensional space

Figure 5 shows results for the same simulation study as in Section 4.3, except that the data were simulated on the two-dimensional unit square, with noise variance $\tau^{2}=0.1$ for the Gaussian likelihood. While all methods are again equivalent to Laplace for $m=n-1$ , the two-dimensional problem is considerably more difficult, and higher values of $m$ were required for accurate approximations. As we can see, the recommended VL-RF had roughly equivalent performance to Laplace once $m$ reached 20, and it was more accurate than VL-IW for $m>10$ . LowRank performed considerably worse than the VL methods, and further simulations (not shown) showed that in some cases LowRank approached the accuracy of Laplace only when $m$ was almost as large as $n$ . A simulation with larger range parameter $\lambda=0.2$ is shown in Section LABEL:supp:sim2D_rangep2 of the supplement; while VL-RF was still more accurate than LowRank for all settings, the larger range reduced the amount of fine-scale variation, thus reducing the advantage of VL over LowRank relative to Figure 5, especially for logistic models. The relative performance of the methods was similar in higher dimensions; plots for 3 and 4 dimensions are shown in Section LABEL:sec:simhigherdim.

For larger $n$ , the differences between LowRank and VL became even more pronounced. Figure 6 shows the RMSE for simulations with increasing sample size $n$ but fixed $m$ . VL-RF improved in accuracy under this asymptotic in-fill scenario almost as fast as Laplace, while LowRank failed to improve.

4.5 Simulations for log-Gaussian Cox processes

Point patterns are sets of points or locations $\mathbf{s}_{1},\ldots,\mathbf{s}_{N}$ in a domain $\mathcal{D}$ . A popular model for point patterns is the log-Gaussian Cox process (LGCP), a doubly stochastic Poisson process whose intensity function $\lambda(\cdot)$ is modeled as random, $\log\lambda(\cdot)=y(\cdot)\sim GP(\mu,C)$ . Inference for LGCPs is difficult due to stochastic integrals.

A natural approximation (e.g., Diggle et al., , 2013) for LGCPs relies on partitioning the domain $\mathcal{D}$ into $n$ grid cells $A_{1},\ldots,A_{n}$ with center points $\mathbf{a}_{1},\ldots,\mathbf{a}_{n}$ , respectively. The number of observed points falling into $A_{i}$ is treated as the data, $z_{i}=z(A_{i})=\sum_{j=1}^{N}1_{\mathbf{s}_{j}\in A_{i}}$ . These gridded data conditionally follow a Poisson distribution, $z_{1},\ldots,z_{n}\,|\,y(\cdot)\,\stackrel{{\scriptstyle ind.}}{{\sim}}\mathcal{P}(\mu(A_{i}))$ , where

[TABLE]

This model falls under the GGP framework, so we can apply our VL methods to obtain fast inference for point patterns.

Figure 7 shows a LGCP whose log-intensity is modeled as a GP with Matérn covariance with range parameter 2.5 on a spatial domain $\mathcal{D}=[0,50]^{2}$ , discretized into $n=2{,}500=50\times 50$ unit-square grid cells. This is equivalent to the simulation in Section 4.4, as the domain can be scaled to a unit-square domain with range 0.05, but on the original scale the grid induces areal regions with unit area, $|A_{i}|=1$ , with intensity function $\mu(A_{i})=\exp(y(\mathbf{a}_{i}))$ . Thus, the averaged results for fitting repeatedly simulated datasets from this LGCP are equivalent to the Poisson results shown in the third column of Figure 5, indicating that VL can be used to obtain virtually equivalent inference to that using a Laplace algorithm, albeit at much lower computational cost for large $n$ .

4.6 Parameter estimation

We also explored parameter estimation based on each method’s integrated-likelihood approximation. Specifically, we considered Poisson data at $n=625$ locations in the unit square, based on a GP with true smoothness $\nu=0.5$ and range $\lambda=0.05$ .

First, we simulated a single realization of the spatial data. Holding the variance fixed at the true value of one, we sequentially evaluated the integrated likelihood on a grid of values for the range and smoothness parameters, using the Laplace approximation in (10), and the VL-RF approximation with $m=20$ in (11). The exact integrated likelihood is intractable. As shown in Figure 8, the integrated likelihoods as approximated by Laplace and by VL were almost identical, while the LowRank approximation was quite poor. These likelihood approximations are equivalent to approximations to the posterior distribution $p(\bm{\theta}|\mathbf{z})$ assuming flat priors for $\bm{\theta}$ . This indicates that Bayesian inference for GGPs can be carried out quickly and accurately using the VL approximation.

We then simulated 100 different realizations of the spatial Poisson data and examined the parameter estimates obtained by maximizing the different approximations to the integrated likelihood. The scatter plot in Figure 8 shows the parameter estimates, using $m=20$ conditioning points for VL and LowRank. While the estimates using Laplace and VL were similar, LowRank had significant outliers that increased the RMSE of the parameter estimates (see Table 2). The LowRank parameter estimation frequently diverged due to the rough likelihood surface, and for those cases we repeated the optimization with bounds $(0.001,20)$ for both range and smoothness, but LowRank still failed repeatedly.

4.7 Interpretation of simulation results

In our simulations, VL provided similar accuracy as Laplace with a considerably smaller number $m$ of conditioning points compared to LowRank. The time required per iteration for VL approaches is $\mathcal{O}(nm^{3})$ . At the expense of fully parallel computation, LowRank can be carried out in $\mathcal{O}(nm^{2})$ time by computing the decomposition of the covariance of the conditioning set once at the beginning of the procedure. However, as VL with any given $m$ , say $m=\widetilde{m}$ , was substantially more accurate than LowRank with $m=\widetilde{m}^{3/2}$ , we conclude that VL is more computationally efficient than LowRank for a given approximation accuracy, except for very smooth posteriors. The improvement in accuracy for VL relative to LowRank became even more pronounced as we increased the sample size under in-fill asymptotics.

5 Application to satellite data

We applied our methodology to a large, spatially correlated, non-Gaussian dataset of column water vapor. These data were collected by NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS), which is mounted on the NASA Aqua satellite (Borbas et al., , 2017). We considered a Level-2 near-infrared dataset of total precipitable water at a $1354\times 2030=2{,}746{,}820$ grid of 1km pixels. We used up to 500,000 of these data points for our demonstration. Our dataset was observed between 13:45 and 13:50 on March 28, 2019 over a rectangular region off the coast of west Africa with west, north, east, and south bounding coordinates -42.707, 67.476, 4.443, and 45.126, respectively and was found on the NASA Earthdata website, https://earthdata.nasa.gov.

Precipitable water amounts are continuous and strictly positive, with values near 0 corresponding to clear skies and larger values implying more water. Exploratory plots showed a right-skewed density, so we assumed that the data can be modeled using a spatial generalized GP with a Gamma likelihood:

[TABLE]

where $E(z(\mathbf{s})|y(\mathbf{s}))=\exp(y(\mathbf{s}))$ , $\mu(\mathbf{s})=\beta_{1}+\beta_{2}\,\text{lat}(\mathbf{s})$ is a linear trend consisting of an intercept and a latitudinal gradient, and $K$ is an isotropic Matérn covariance function with variance $\sigma^{2}$ , smoothness $\nu$ , and range parameter $\rho$ . We estimated the parameter values $\beta_{1}=-1.515$ , $\beta_{2}=0.000766$ , $a=0.89$ , $\sigma^{2}=.25$ , $\rho=31$ km, and $\nu=3$ as described in Section LABEL:supp:MODIS.

We again compared our VL approach to a LowRank method. We randomly sampled $n=250{,}000$ observations $\mathbf{z}$ of the full dataset as training data, and $250{,}000$ of the remaining observations as test data $\mathbf{z}^{\star}$ at locations $\mathcal{S}^{\star}$ . For VL, we set $m=20$ following our recommendations in Section 4.4 and further justified in Section LABEL:supp:MODIS. For LowRank, we used $m=89\approx(20)^{3/2}$ for a computationally fair comparison. On an Intel Xeon E5-2690 CPU with 64GB RAM, Algorithm 1 for VL required 10 iterations with a total run time of about 18 minutes (1.8 minutes per iteration). Taking advantage of an implementation that achieves the $\mathcal{O}(nm^{2})$ scaling, each iteration for LowRank required 1.3 minutes on average across 6 iterations. Note that, based on our numerical experiments, we estimate that Laplace without further approximation would take months of computing time, while HMC-based approaches would take years to achieve the same accuracy as VL.

Figure 9(a) shows prediction maps of the posterior mean $E(\mathbf{z}^{\star}|\mathbf{z})=E(\exp(\mathbf{y}^{\star})|\mathbf{z})$ with $i$ th entry $\exp(E(y_{i}^{\star}|\mathbf{z})+var(y_{i}^{\star}|\mathbf{z})/2)$ . Clearly, much of the fine-scale structure was lost when using LowRank. To further illustrate this issue, we made predictions on a $200\times 200$ grid over a small subregion. As shown in Figure 9(b), the LowRank predictions were virtually useless at this scale, while VL was able to recover much of the important spatial structure from the noisy and incomplete training data.

Table 3 quantifies the improvement in predictions using VL over LowRank. We computed the MSE based on the posterior mean $E(\mathbf{z}^{\star}|\mathbf{z})$ . To compare the accuracy of the uncertainty quantification, we also computed the continuous ranked probability score (CRPS; e.g., Gneiting and Katzfuss, , 2014), which encourages well calibrated and sharp predictive distributions. Table 3 shows that VL strongly outperformed LowRank for comparable computational complexity.

6 Conclusions and future work

In this work, we presented a novel combination of techniques that allow for efficient analysis of large, spatially correlated, non-Gaussian datasets or point patterns. The key idea is to apply a Vecchia approximation to the Gaussian (and hence tractable) joint distribution of GP realizations and pseudo-data at each iteration of a Newton-Raphson algorithm, leading to a Gaussian Laplace approximation. This allows us to carry out inference for non-Gaussian data by iteratively applying existing Vecchia approximations for Gaussian pseudo-data, which are updated at each iteration. Our Vecchia-Laplace (VL) techniques guarantee linear complexity in the data size while capturing spatial dependence at all scales. Compared to alternative methods such as low-rank approximations or sampling-based approaches, our VL approximations can achieve higher accuracy at a fraction of the computation time.

Vecchia approximations require specification of an ordering of the model variables and of a conditioning set for each variable, and these two issues also play a critical role in the performance of our VL approaches. Through simulation studies, we showed that, in one-dimensional space, interweaving the GP realizations and the pseudo-data (Katzfuss and Guinness, , 2019) can provide results that are virtually indistinguishable from Laplace, even for very small conditioning sets. For two-dimensional space, we recommend the response-first Vecchia approximation (Katzfuss et al., 2020a, ). Due to the computational efficiency of our approach, it is also possible to use a VL approximation of the integrated likelihood for parameter inference, for which we recommend the interweaved ordering in any dimension.

The methods and algorithms proposed here are implemented in the R package GPvecchia (Katzfuss et al., 2020c, ). The default settings of the package functions reflect the recommendations in the previous paragraph. The tuning parameter $m$ , which controls a trade-off between accuracy and computation cost, can be set by the user. In practice, a useful strategy is to start with a relatively small value of $m$ and gradually increase it until the inference converges or the computational resources are exhausted.

Our methods and code are applicable in more than two dimensions, but a thorough investigation of their properties in this context will be carried out in future work. For example, Katzfuss et al., 2020b show that Vecchia-based approximations with appropriate extensions can be highly accurate for computer-model emulation in up to ten dimensions; a combination with our VL methods could allow emulation of non-Gaussian computer-model output. Other potential future work includes extending the Laplace approximation in our methods to an integrated nested Laplace approximation (INLA) that improves the accuracy of the marginal posteriors of the latent variables (Rue et al., , 2009, Sect. 3.2); the use of conjugate-gradient (Zhang et al., , 2019) or incomplete-Cholesky (Schäfer et al., , 2020) methods that allow the computation of the latent posterior mean in linear time even for completely latent Vecchia approximations; or extensions to spatio-temporal filtering using Vecchia approximations based on domain partitioning (Jurek and Katzfuss, , 2018, 2020).

Acknowledgments

Katzfuss’ research was partially supported by National Science Foundation (NSF) grant DMS–1521676 and NSF CAREER grant DMS–1654083. The authors would like to thank Joe Guinness, Jianhua Huang, Leonid Koralov, Boris Vainberg, and several anonymous reviewers for helpful comments and suggestions. Wenlong Gong, Joe Guinness, Marcin Jurek, and Jingjie Zhang contributed to the R package GPvecchia, and Florian Schäfer provided C code for the exact maxmin ordering.

Appendix A Newton-Raphson update using pseudo-data

The desired Newton-Raphson update has the form

[TABLE]

As shown in Section 2.2, we have $\frac{\partial}{\partial\mathbf{y}}\log p(\mathbf{y}|\mathbf{z})=\mathbf{K}^{-1}(\bm{\mu}-\mathbf{y})+\mathbf{u}_{\mathbf{y}}$ and $-\frac{\partial^{2}}{\partial\mathbf{y}\mathbf{y}^{\prime}}\log p(\mathbf{y}|\mathbf{z})=\mathbf{K}^{-1}+\mathbf{D}_{\mathbf{y}}^{-1}=\mathbf{W}_{\mathbf{y}}$ . Using an idea similar to iterative weighted least squares (Section 2.5, McCullagh and Nelder, , 1989), we can premultiply the variable $\mathbf{y}$ by the Hessian to combine terms, and then rearrange and pull out the prior mean. Dropping the iteration subscript of $\mathbf{y}$ for ease of notation, we can write (12) as

[TABLE]

where $\mathbf{t}=\mathbf{y}+\mathbf{D}\mathbf{u}$ .

Now consider the posterior mean in the case of a Gaussian likelihood $\mathbf{t}|\mathbf{y}\sim\mathcal{N}_{n}(\mathbf{y},\mathbf{D})$ with a conjugate Gaussian prior, $\mathbf{y}\sim\mathcal{N}_{n}(\bm{\mu},\mathbf{K})$ . Employing a well-known formula, we have

[TABLE]

Thus, we have $\mathbf{h}(\mathbf{y})=\operatorname{E}(\mathbf{y}|\mathbf{t})$ , the posterior mean under the assumption of Gaussian pseudo-data $\mathbf{t}$ .

Appendix B Computing $\mathbf{U}$

Consider a general Vecchia approximation of the form (7). To obtain $\mathbf{U}$ , define $C(x_{i},x_{j})$ as the covariance between $x_{i}$ and $x_{j}$ implied by the true model; that is, $C(y_{i},y_{j})=C(t_{i},y_{j})=K(\mathbf{s}_{i},\mathbf{s}_{j})$ and $C(t_{i},t_{j})=K(\mathbf{s}_{i},\mathbf{s}_{j})+\mathbbm{1}_{i=j}d_{i}$ . Then, the $(j,i)$ th element of $\mathbf{U}$ can be calculated as

[TABLE]

where $\mathbf{b}_{i}^{\prime}=C(x_{i},\mathbf{x}_{c(i)})C(\mathbf{x}_{c(i)},\mathbf{x}_{c(i)})^{-1}$ , $r_{i}=C(x_{i},x_{i})-\mathbf{b}_{i}^{\prime}C(\mathbf{x}_{c(i)},x_{i})$ , and $b_{i}^{(j)}$ denotes the $k$ th element of $\mathbf{b}_{i}$ if $j$ is the $k$ th element in $c(i)$ (i.e., $b_{i}^{(j)}$ is the element of $\mathbf{b}_{i}$ corresponding to $x_{j}$ ).

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Banerjee et al., (2008) Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society, Series B , 70(4):825–848.
2Bonat and Ribeiro Jr, (2016) Bonat, W. H. and Ribeiro Jr, P. J. (2016). Practical likelihood analysis for spatial generalized linear mixed models. Environmetrics , 27(2):83–89.
3Borbas et al., (2017) Borbas, E., Menzel, P., and Gao, B. C. (2017). MODIS Atmosphere L 2 Water Vapor Product.
4Boyd and Vandenberghe, (2004) Boyd, S. and Vandenberghe, L. (2004). Convex Optimization . Cambridge University Press.
5Bradley et al., (2018) Bradley, J. R., Holan, S. H., and Wikle, C. K. (2018). Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data (with discussion). Bayesian Analysis , 13(1):253–310.
6Breslow and Clayton, (1993) Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association , 88(421):9–25.
7Chan and Dong, (2011) Chan, A. B. and Dong, D. (2011). Generalized Gaussian process models. In CVPR , pages 2681–2688.
8Cressie and Johannesson, (2008) Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B , 70(1):209–226.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data

Abstract

1 Introduction

2 Review of existing results

2.1 Generalized Gaussian processes

2.2 Review of the Laplace approximation

2.3 Review of the general Vecchia approximation

2.4 Ordering in Vecchia approximations

2.4.1 Interweaved (IW) ordering

2.4.2 Response-first (RF) ordering

3 Vecchia-Laplace methods

3.1 The VL algorithm

3.2 Integrated likelihood for parameter inference

3.3 Predictions at unobserved locations

3.4 Properties

3.4.1 Complexity

3.4.2 Approximation errors

3.4.3 Convergence

4 Simulations and comparisons

4.1 Comparison to MCMC

4.2 Computational scaling of Laplace approximations

4.3 VL accuracy in one-dimensional space

4.4 VL accuracy in two-dimensional space

4.5 Simulations for log-Gaussian Cox processes

4.6 Parameter estimation

4.7 Interpretation of simulation results

5 Application to satellite data

6 Conclusions and future work

Acknowledgments

Appendix A Newton-Raphson update using pseudo-data

Appendix B Computing U\mathbf{U}U

Appendix B Computing $\mathbf{U}$