A Hierarchical Spatio-Temporal Analog Forecasting Model for Count Data
Patrick L. McDermott, Christopher K. Wikle, Joshua Millspaugh

TL;DR
This paper introduces a Bayesian hierarchical spatio-temporal analog forecasting model for count data, improving forecast accuracy and uncertainty quantification in ecological applications like waterfowl settling patterns.
Contribution
It extends previous analog forecasting methods into a formal Bayesian hierarchical framework for count data, demonstrating enhanced accuracy over traditional models.
Findings
Hierarchical analog model outperforms Poisson Bayesian models in forecast accuracy.
The Bayesian approach allows rigorous uncertainty quantification.
Application to waterfowl data shows practical utility in ecological forecasting.
Abstract
1. Analog forecasting has been successful at producing robust forecasts for a variety of ecological and physical processes. Analog forecasting is a mechanism-free nonlinear method that forecasts a system forward in time by examining how past states deemed similar to the current state moved forward. Previous work on analog forecasting has typically been presented in an empirical or heuristic context, as opposed to a formal statistical context. 2. The model presented here extends the model-based analog method of McDermott and Wikle (2016) by placing analog forecasting within a fully hierarchical statistical frame- work. In particular, a Bayesian hierarchical spatial-temporal Poisson analog forecasting model is formulated. 3. In comparison to a Poisson Bayesian hierarchical model with a latent dynamical spatio- temporal process, the hierarchical analog model consistently produced more…
| 1999 | 2009 | 2014 | ||||||
|---|---|---|---|---|---|---|---|---|
| Model | MSPE | Corr | MSPE | Corr | MSPE | Corr | ||
| HBA1 | 58.822 | 83.031% | 63.056 | 70.307% | 59.694 | 78.699% | ||
| HBA2 | 62.575 | 82.856% | 70.085 | 66.808% | 57.799 | 79.446% | ||
| PST | - | - | 73.435 | 66.103% | 69.975 | 77.780% | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Hierarchical Spatio-Temporal Analog Forecasting Model for Count Data
Patrick L. McDermott 22footnotemark: 2 , Christopher K. Wikle, Joshua Millspaugh Correspondence to: P.L. McDermott, Department of Statistics, University of Missouri, 146 Middlebush Hall, Columbia, MO 65211 U.S.A.. E-mail: [email protected] of Statistics, University of Missouri, Columbia, MO, 65211, U.S.A.Wildlife Biology Program, University of Montana, Missoula, MT, 59812, U.S.A.
Summary
Analog forecasting has been successful at producing robust forecasts for a variety of ecological and physical processes. Analog forecasting is a mechanism-free nonlinear method that forecasts a system forward in time by examining how past states deemed similar to the current state moved forward. Previous work on analog forecasting has typically been presented in an empirical or heuristic context, as opposed to a formal statistical context. 2. 2.
The model presented here extends the model-based analog method of McDermott and Wikle (2016) by placing analog forecasting within a fully hierarchical statistical framework. In particular, a Bayesian hierarchical spatial-temporal Poisson analog forecasting model is formulated. 3. 3.
In comparison to a Poisson Bayesian hierarchical model with a latent dynamical spatio-temporal process, the hierarchical analog model consistently produced more accurate forecasts. By using a Bayesian approach, the hierarchical analog model is able to quantify rigorously the uncertainty associated with forecasts. 4. 4.
Forecasting waterfowl settling patterns in the northwestern United States and Canada is conducted by applying the hierarchical analog model to a breeding population survey dataset. Sea Surface Temperature (SST) in the Pacific ocean is used to help identify potential analogs for the waterfowl settling patterns.
Keywords: Nonlinear forecasting; hierarchical Bayesian models; ecological forecasting; waterfowl settling patterns
Introduction
Contemporary issues in natural resource management such as climate change rely increasingly on quantitative forecasts at time scales ranging from seasonal to decadal (e.g., LeBrun et al., 2016). There are great challenges when making such forecasts in a rapidly changing environment. One of the most important challenges to policy and management is to quantify the uncertainty of the forecasts (e.g., Clark et al., 2001; Conroy et al., 2011, and references therein). There are many potential issues with quantifying uncertainty, related to the characterization of uncertainties in data, mechanistic processes, and interactions across biological and physical systems (e.g., Oliver and Roy, 2015). Perhaps surprisingly, in many cases, the best forecast models rely on non-parametric and “mechanism-free” specifications (e.g., Perretti et al., 2013; Ward et al., 2014). Bayesian models in general, and Bayeisan hierarchical models in particular, provide a comprehensive modeling framework which account for multiple sources of uncertainty in ecological models (e.g., Wikle, 2003; Royle and Dorazio, 2008; Cressie et al., 2009, to name a few); for a historical overview see Ellison (2004). To date, there have been few attempts to cast “mechanism-free” models within the Bayesian framework (McDermott and Wikle, 2016).
Quantifying uncertainty for spatial-temporal ecological processes is complicated because the evolution of these processes over time is often nonlinear. One mechanism-free solution to the spatio-temporal forecasting problem is known as “analog forecasting” (e.g., Lorenz, 1969). Analog forecasting uses past states of a system that are similar to the current state and then assumes that the current state of the system will evolve in a manner similar to how the identified past states evolved. Analog forecasting is appealing for dynamical processes governed by some underlying, but unspecified, deterministic law. Specifically, analog forecasting leverages the predictability in these types of systems by finding past trajectories similar to the current trajectory of the system.
Much of the current development of spatio-temporal analog methods utilizes the idea of embedding a dynamical system in time, similar to the simplex prediction method outlined in Sugihara and May (1990) for univariate time series. Indeed, the Sugihara and May (1990) approach was one of the first practical methods to introduce the idea of embedding a dynamical system in the context of nonlinear forecasting. Their methods utilized the state-space dynamical system reconstruction theory of Takens (1981). In complicated dynamical systems, one rarely observes all of the state variables. State-space reconstruction allows one to reconstruct a dynamical system with only a subset of the state variables, by considering those state variables at multiple lags in the past. As dynamical systems evolve in time they tend to revisit previous paths in the phase space, where these paths live on some low-dimensional manifold of the entire space (i.e., the attractor). Thus, through the use of state-space reconstruction one can recover features of past dynamical paths along the attractor. Sugihara and May (1990) recognized the utility of state-space reconstruction within the context of nonlinear forecasting. In particular, they showed how embedding vectors, created by lagging past states of a system (historical data) in time (e.g., see Chapter 3 of Cressie and Wikle, 2011), could be utilized to find robust analogs for the current state of the system. This remarkably simple forecasting method has proved successful in a multitude of time series applications (e.g., Sugihara et al., 2012; Perretti et al., 2013; Zhao and Giannakis, 2014) .
Mechanism-free and analog methods traditionally have relied on non-parametric and/or heuristic approaches that did not include a formal probabilistic error structure (although, see Tippett and DelSole, 2013). Modern non-parametric analog methods require choice of the embedding dimension of the analogs, the number of past analogs to consider, and weights for those analogs. All of these choices can significantly impact the analog forecast. For example, the question of how many past analogs to use can be thought of as a k-nearest neighbor problem, where the neighborhood consists of the analogs most similar to the current state of the system. Given the number of “neighboring” analogs, a kernel defined by a smoothing parameter is typically used to determine the weights e.g., Zhao and Giannakis, 2014; McDermott and Wikle, 2016. However, previous analog forecasting implementations have employed either some heuristic method that does not explicitly account for uncertainty associated with the choice, or a multidimensional cross-validation search (e.g., Arora et al., 2013), to choose these values. The Bayesian framework described in McDermott and Wikle (2016) allows for both the estimation and incorporation of model averaging over the various parameters in the analog model, thereby accounting for the uncertainty induced by their selection.
Once framed within the context of Bayesian modeling, analog forecasting can be placed within the rich class of models available in the space-time hierarchical Bayesian framework (e.g., Cressie and Wikle, 2011; Wikle, 2015), which allows for robust quantification of uncertainity. We present here a hierarchical analog forecasting model that extends the model developed in McDermott and Wikle (2016) to include a formal non-Gaussian data model – specifically, a Poisson model to accommodate count data. This is the first analog method that accounts explicitly for non-Gaussian data within a statistical framework. The model is applied to the problem of producing one year-ahead forecasts of waterfowl settling patterns given the state of the Pacific ocean sea surface temperature (SST). Because spatio-temporal analog forecasting can quickly become prohibitive for high-dimensional processes, we introduce an approach for spatio-temporal dimension reduction of count data known as nonnegative matrix factorization (Lee and Seung, 2001).
Materials and Methods
0.1 Waterfowl and Sea Surface Temperature Data
Migratory waterfowl settling patterns, productivity, and survival have been shown to depend strongly on climate-related habitat conditions (e.g., Hansen and McKnight, 1964; Herter, 2012; Feldman et al., 2015). It is known that changes to habitat conditions can lead to more flexible settling patterns along a lattitudinal gradient that can mitigate site philopatry, and possibly decrease productivity or recruitment (e.g., Johnson and Grier, 1988; Karanth et al., 2014; Becker, 2015). Given the well-known relationships between Pacific ocean (particularly the tropical ocean) SSTs and North American climate conditions (e.g., Philander, 1990) and the potential for these conditions to affect waterfowl settling patterns (Sorenson et al., 1998), it is reasonable to use Pacific ocean SST as a proxy for future habitat conditions. In addition, the impact of Pacific SSTs is typically nonlinear (Hoerling et al., 1997), suggesting nonlinear evolution models are appropriate. Although others (e.g., Wu et al., 2013) have successfully forecast Mallard duck (Anas platyrhyncho) settling patterns using a drought severity index, we provide a one-year forecast given the Pacific SSTs through the previous May.
Since 1955 the U.S. Fisheries and Wildlife Service (USFWS) and Canadian Wildlife Service (CWS) have jointly conducted a Breeding Population Survey (BPS) in the northern United States and Canada. Each spring (mid to late May) crews consisting of one pilot and one observer fly transect lines and record counts of various waterfowl species. For selected areas, ground crews also record counts. Each 400 m wide transect is divided into a series of segments measuring 29 km in length and an entire survey covers approximately 2.1 square kilometers. The analysis conducted here consists of the 1,067 locations between W longitude and N latitude from 1970 through 2014. The majority of survey locations north of latitude have little temporal variability, with zero counts in most years and are not considered. Although the BPS survey records counts for several species, we focus on raw indicator pair counts (i.e., counts of paired ducks and lone drakes) for mallards. The raw indicated pair counts are publicly available through the FWS Division of Migratory Management (https://migbirdapps.fws.gov/).
Monthly SST from 1970-2014 were obtained from the publicly available National Oceanic and Atmospheric Administration (NOAA) Extended Reconstruction Sea Surface Temperature (ERSST) data (http://www.esrl.noaa.gov/psd/). A subset of 3,132 locations from the ERSST data, between S-N latitude and E-E longitude with a spatial resolution of , form the SST data. We follow the common procedure from the climate science literature by creating anomalies through the subtraction of location specific monthly means calculated from a climatological average spanning the period 1970-1999 (e.g., Wilks, 2011).
0.2 Spatio-Temporal Variables
Let be a component of a dynamical system at time with spatial locations , . Suppose we have access to data from the system for time periods . The set of data at all locations for time period is defined as, . Here, we consider count valued data for . Further, we consider the use of some spatio-temporal forcing (predictor) variable, defined as, \mbox{\boldmathx\unboldmath}_{t^{\prime}}=(x_{t^{\prime}}(\mbox{\boldmathr\unboldmath}_{1}),\dots,x_{t^{\prime}}(\mbox{\boldmathr\unboldmath}_{n_{x}}))^{\prime}, for spatial locations, and time , to help forecast the process of interest (i.e., ). Note that the time indices and are separated by period(s) (i.e., ), with potentially different time scales. As discussed in more detail below, in our application, represents the number of periods the response variable is forecasted into the future. Thus, the goal here is to forecast the value of given values of for and for for . This is done by weighting the past values of based on how well corresponding past sequences of match the most recent sequence of (i.e., the most recent sequence up to time ), as described below.
Many spatio-temporal dynamical processes can be challenging to model due to the high-dimensional nature of the spatial component. Both the BPS waterfowl settling pattern data and SST data described above can be considered high-dimensional. To efficiently model such spatio-temporal processes, some form of dimension reduction is usually performed (e.g., see Chapter 7 of Cressie and Wikle, 2011). Common methods such as empirical orthogonal functions (EOFs) are not ideal for non-continuous responses such as count data because it is difficult to impose constraints (e.g., such as non-negativity). Although more general ordination methods such as principal coordinate analysis and multidimensional scaling can be useful for non-continous data (e.g., Ellison and Gotelli, 2004), these methods also do not guarantee, in general, that after dimension reduction and projection back into physical space, that the resulting process has the same support as the original data.
0.3 Response Vector Dimension Reduction
Consider the case where we have spatial locations and the -dimensional response vector at time , . We seek a -dimensional expansion coefficient vector, \mbox{\boldmath\beta\unboldmath}_{t}, associated with a set of basis functions \{\mbox{\boldmath\psi\unboldmath}_{j},\ j=1,\dots,n_{\beta}\}, where \mbox{\boldmath\psi\unboldmath}_{j}\equiv(\psi_{j}(\mathbf{s}_{1}),\dots,\psi_{j}(\mathbf{s}_{n_{y}}))^{\prime}. In particular, we seek a reduced dimension representation such that . When considering a linear basis expansion, then, we seek \mathbf{Y}_{t}\approx\mbox{\boldmath\Psi\unboldmath}\mbox{\boldmath\beta\unboldmath}_{t}, where {\mbox{\boldmath\Psi\unboldmath}}\equiv[\psi_{1},\dots,\psi_{n_{\beta}}] is a matrix. Then, the ordinary least squares estimate of the expansion coefficients is \widetilde{\mbox{\boldmath\beta\unboldmath}}_{t}=(\mbox{\boldmath\Psi\unboldmath}^{\prime}\mbox{\boldmath\Psi\unboldmath})^{-1}\mbox{\boldmath\Psi\unboldmath}^{\prime}\mathbf{Y}_{t}, assuming (\mbox{\boldmath\Psi\unboldmath}^{\prime}\mbox{\boldmath\Psi\unboldmath}) is invertible. In situations where is orthogonal, this simplifies to \widetilde{\mbox{\boldmath\beta\unboldmath}}_{t}=\mbox{\boldmath\Psi\unboldmath}^{\prime}\mathbf{Y}_{t}. As an example, derived from the scaled left singular vectors of a full data matrix, , are the EOF basis functions, and are orthogonal. A reduced rank representation of the response vectors in phase space is given by \widetilde{\mathbf{Y}}_{t}=\mbox{\boldmath\Psi\unboldmath}\widetilde{\mbox{\boldmath\beta\unboldmath}}_{t}. Typically, one then considers the expansion coefficients, \widetilde{\mbox{\boldmath\beta\unboldmath}}_{t}, as the time-varying variable of interest.
When has a constrained support, as with the count data of interest here, there is no guarantee that this back transformation (\widetilde{\mathbf{Y}}_{t}=\mbox{\boldmath\Psi\unboldmath}\widetilde{\mbox{\boldmath\beta\unboldmath}}_{t}) will result in appropriate support for the elements of (e.g., non-negative values). This issue can be important in some applications, such as the analog forecasting problem of interest here, as we specify the \mbox{\boldmath\beta\unboldmath}_{t}’s in a hierarchical model and require non-negative values upon transformation back to physical space.
We employ nonnegative matrix factorization (NMF) (e.g., Lee and Seung, 2001) to enforce non-negativity in the dimension reduction of the count data matrix. Given the data matrix , NMF gives:
[TABLE]
where is a basis function matrix and the matrix \mathbf{B}\equiv[{\mbox{\boldmath\beta\unboldmath}}_{1},\dots,{\mbox{\boldmath\beta\unboldmath}}_{T}] contains (random) projection coefficients. In reference to (1), the notation for some matrix , implies that each element of is nonnegative. NMF has been applied in a variety of disciplines because of its ability to provide efficient dimension reduction while creating nonnegative basis functions. A number of different algorithms to conduct NMF have been proposed in the literature (e.g., Berry et al., 2007), all with the goal of solving the following minimization problem:
[TABLE]
where D(\mathbf{Y},\mbox{\boldmath\Psi\unboldmath},\mathbf{B}) is a loss function and R(\mbox{\boldmath\Psi\unboldmath},\mathbf{B}) is some regularization function. Unfortunately, these NMF algorithms do not produce a unique factorization. Instead, they converge to a local minimum, thus producing different factorizations for different starting values (e.g., Boutsidis and Gallopoulos, 2008). To alleviate this non-uniqueness problem in our methodology, we use the Nonnegative Double Singular Value Decomposition (NNSVD) approach of Boutsidis and Gallopoulos (2008) to obtain starting values. Note that NNSVD was designed to produce fast convergence for sparse data structures (i.e., when contains a large number of zeros, as is the case with our BPS settling pattern data). The application to follow uses the so-called off-set NMF algorithm of Badea (2008).
0.4 Forcing Vector Dimension Reduction
The purpose of the forcing variables, \{\mbox{\boldmathx\unboldmath}_{t^{\prime}}\}, is to identify robust analogs to help predict the response variable. Further, the success of any analog forecasting model is largely determined by its ability to find robust analogs. If is large, we typically must reduce the dimension of the process by using spatial basis functions, \mbox{\boldmath\Phi\unboldmath}\equiv[\mbox{\boldmath\phi\unboldmath}_{1},\dots,\mbox{\boldmath\phi\unboldmath}_{n_{\alpha}}], where \mbox{\boldmath\phi\unboldmath}_{k}=(\phi_{k}(\mathbf{r}_{1}),\dots,\phi_{k}(\mathbf{r}_{n_{x}}))^{\prime}. As with the response vector, if we assume linear projections, we can get projection coefficients by \mbox{\boldmath\alpha\unboldmath}_{t^{\prime}}=(\mbox{\boldmath\Phi\unboldmath}^{\prime}\mbox{\boldmath\Phi\unboldmath})^{-1}\mbox{\boldmath\Phi\unboldmath}^{\prime}\mbox{\boldmathx\unboldmath}_{t^{\prime}}. McDermott and Wikle (2016) show that these projection coefficients can be combined to form time lagged embedding matrices. That is, let represent the number of periods lagged back in time, then for period we can define the following embedding matrix:
[TABLE]
These embedding matrices are critical to the success of the analog forecasting model outlined below. For example, suppose we wanted to investigate if the response variable at period was a robust analog for the response at period . One could construct an embedding matrix corresponding to period and another matrix for period . We could assess the quality of as an analog for the response at period , by examining the “distance” between and .
The selection of basis functions to obtain \mbox{\boldmath\alpha\unboldmath}_{t^{\prime}} can be flexible here and different choices of could potentially produce different sets of analogs. For example, EOFs would be an obvious choice if linearity was assumed. However, there is scientific evidence of a nonlinear relationship between precipitation (which could potentially affect habitat conditions) and SST anomalies (e.g., Hoerling et al., 1997), so we investigated several nonlinear dimension reduction techniques for the waterfowl settling pattern application.
0.5 Hierarchical Analog Forecasting Model
We now discuss the specifics of the spatio-temporal hierarchical Bayesian analog (HBA) forecasting model for count data. All of the stages of the presented HBA model are summarized in Table 1 below. Since our responses are count valued, we model the data with a Poisson distribution conditional on a spatio-temporal intensity process as:
[TABLE]
where \{\mbox{\boldmath\lambda\unboldmath}_{t}:t=1,\dots,T\} is the -dimensional intensity process at locations . Using the basis functions from the NMF approximation (1), let \mbox{\boldmath\lambda\unboldmath}_{t}=\mbox{\boldmath\Psi\unboldmath}\mbox{\boldmath\beta\unboldmath}_{t}. Recall, the NMF guarantees \mbox{\boldmath\Psi\unboldmath}\geq 0 and thus, for \mbox{\boldmath\lambda\unboldmath}_{t} to be nonnegative, the distribution for \mbox{\boldmath\beta\unboldmath}_{t} should have nonnegative support. If we denote the model parameters by \widetilde{\mbox{\boldmath\Theta\unboldmath}} (see below), then for period , the process model on {\mbox{\boldmath\beta\unboldmath}}_{t} is given by the truncated normal distribution:
[TABLE]
where, for period , we define \mathbf{B}_{-t}\equiv[{\mbox{\boldmath\beta\unboldmath}}_{1},\dots,{\mbox{\boldmath\beta\unboldmath}}_{t-1},{\mbox{\boldmath\beta\unboldmath}}_{t+1},\dots,{\mbox{\boldmath\beta\unboldmath}}_{T}] as the matrix of possible analogs and {\mbox{\boldmath\omega\unboldmath}}_{t}=(\omega(\mathbf{A}_{t},\mathbf{A}_{1},{\mathbf{\theta}}),\dots,\omega(\mathbf{A}_{t},\mathbf{A}_{t-1},{\mathbf{\theta}}),\omega(\mathbf{A}_{t},\mathbf{A}_{t+1},{\mathbf{\theta}}),\dots,\omega(\mathbf{A}_{t},\mathbf{A}_{T},{\mathbf{\theta}}))^{\prime}, as the weight associated with each of the potential analogs. Thus, a weighted prediction of the new {\mbox{\boldmath\beta\unboldmath}}_{t} is based on the linear combination of past {\mbox{\boldmath\beta\unboldmath}}_{t} values, {\mathbf{B}}_{-t}{\mbox{\boldmath\omega\unboldmath}}_{t}. Further, as described in Cangelosi and Hooten (2009), for a normal density left-truncated at zero, the mean is biased and this bias increases for values close to zero (which is the case for many elements of \mbox{\boldmath\beta\unboldmath}_{t}) since the left tail of the distribution has been distorted from the truncation at zero. In equation (5), is the bias correction function presented in Cangelosi and Hooten (2009). The need for the constant arises because as \mathbf{B}_{-t}\ {\mbox{\boldmath\omega\unboldmath}}_{t}\to 0, we have h(\ \mathbf{B}_{-t}\ {\mbox{\boldmath\omega\unboldmath}}_{t},\sigma^{2}_{\eta}\ )\to-\infty. Thus, is set to an arbitrarily small value for computational purposes. The weights ({\mbox{\boldmath\omega\unboldmath}}_{t}) in (5) are critical to the success of the analog forecasting model presented here. For example, during the training of the model, these weights determine how much each potential analog in is weighted in order to predict {\mbox{\boldmath\beta\unboldmath}}_{t}. We describe our choice of weights in the next section. It is important to note that, although the weights are applied to the potential analogs in a linear fashion (i.e., \mathbf{B}_{-t}\ {\mbox{\boldmath\omega\unboldmath}}_{t}), the resulting prediction for {\mbox{\boldmath\beta\unboldmath}}_{t} can be considered nonlinear since the weights are determined by a nonlinear function (i.e., the Gaussian kernel).
The choice of analog weights and the analog “neighborhood” are closely connected and important considerations in analog forecasting. Let denote the neighborhood of the analog for period , where the number of nearest neighbors considered is represented by . Defining as a distance metric and as a set of kernel dependent parameters, we have the following kernel weight function:
[TABLE]
for , where is a kernel smoothing parameter and is a parameter associated with the distance function (see the Appendix). Let, be the normalized version of , where the normalization is accomplished by dividing by the sum of across all potential analogs for period . Any valid distance metric can be applied here; e.g., analog forecasting methods traditionally use Euclidean distance. However, analog forecasting relies on identifying analogs that not only resemble the current state of the system, but also move forward in a similar manner. For this reason, analogs that share a similar trajectory in phase space as the current trajectory of the system will produce the most successful forecasts. Procrustes distance (e.g., see Hastie et al., 2013) is a multivariate distance metric that transforms a comparison object (i.e., ) to a target object (i.e., ), before calculating the Frobenius matrix norm between the target object and the transformed comparison object. Therefore, by defining as the Procrustes distance (see the Appendix for the full details, including the specification of ) we are able to compare the shape, and thus, the trajectory, between two embedding matrices (see Figure 1 for a visual example). In the definition of , we let represent the number of lagged time periods when forming . Since different values of will lead to different embedding matrices, and thus potentially different analogs, we estimate and give it a discrete uniform prior such that, . We also assign a discrete uniform prior to the number of neighbors parameter, . Finally, and are both assigned inverse gamma priors, and .
Sampling from the posterior distribution is accomplished with Markov Chain Monte Carlo (MCMC) methods (e.g., Robert and Casella, 2004). Due to the lack of conjugacy, all parameters are updated with a Metropolis-Hastings step (see the outline in the Appendix). During each iteration of the MCMC sampler, parameters are sampled using data from training periods, . At this stage, all prediction is “in-sample”. For period , out-of-sample forecasts are then drawn from the posterior prediction distribution, \mathbf{Y}_{T+1}^{(\ell)}\sim\text{Poi}(\mbox{\boldmath\Psi\unboldmath}\mbox{\boldmath\beta\unboldmath}_{T+1}^{(\ell)}), after each iteration, , of the sampler. By defining, \mathbf{B}_{T+1}^{(\ell)}=[{\mbox{\boldmath\beta\unboldmath}}^{(\ell)}_{1},\dots,{\mbox{\boldmath\beta\unboldmath}}^{(\ell)}_{T}] and {\mbox{\boldmath\omega\unboldmath}}^{(\ell)}_{T+1}=(\omega(\mathbf{A}_{T+1},\mathbf{A}_{1},{\mathbf{\theta}}^{(\ell)}),\dots,\omega(\mathbf{A}_{T+1},\mathbf{A}_{T},{\mathbf{\theta}}^{(\ell)}))^{\prime}, the projection coefficients for period can be forecasted for the iteration as, \mbox{\boldmath\beta\unboldmath}_{T+1}^{(\ell)}=\mathbf{B}_{T+1}^{(\ell)}{\mbox{\boldmath\omega\unboldmath}}_{T+1}^{(\ell)}. In this example, is the initial condition for which we seek matching past analogs. Then, from the definition of (3), the first element of is \mbox{\boldmath\alpha\unboldmath}_{T^{\prime}+1}, which is lagged periods behind the forecast target time, , thus leading to a -period ahead forecast of (see Figure 1 for an illustrative example).
0.6 Model Setup
We evaluate the predictive ability of the model by considering forecasts of waterfowl counts in 2009 and 2014, while also producing hindcasts for 1999. The year 2009 was chosen due to the relative lack of correlation between the mallard counts in 2009 and the prior year. Further, we choose to consider 1999 because it was a strong La Niña year, which allows us to demonstrate how the model can effectively forecast years where waterfowl patterns may change due to alternating habitat conditions. All of the data prior to the respective year is used for training 2009 and 2014, while the hindcast is implemented by training on all of the data except the counts for 1999. We make one-year ahead forecasts for all time periods by setting .
We compare the forecasting skill of the HBA model with a fairly state-of-the art hierarchical Bayesian Poisson space-time model (referred to as the PST model). The PST model is comprised of a Poisson data model, \mathbf{Y}_{t}\mid\mbox{\boldmath\lambda\unboldmath}_{t}\sim\text{Poi}(\mbox{\boldmath\lambda\unboldmath}_{t}), and process model defined as, \text{log}(\mbox{\boldmath\lambda\unboldmath}_{t})\sim\text{Gau}({\mbox{\boldmath\mu\unboldmath}}+\mbox{\boldmath\Psi\unboldmath}\mbox{\boldmath\alpha\unboldmath}_{t},\sigma^{2}_{\epsilon}\mathbf{I}). Here, is a spatially indexed mean (modeled with spatial covariates), and \mbox{\boldmath\alpha\unboldmath}_{t} are projection coefficients formed from kernel principal component analysis (see below). The projection coefficients are modeled with a reduced rank vector-autoregressive (VAR) structure such that, \mbox{\boldmath\alpha\unboldmath}_{t}\sim\text{Gau}(\mathbf{H}\mbox{\boldmath\alpha\unboldmath}_{t-1},\mbox{\boldmath\Sigma\unboldmath}_{\gamma}) (e.g., see Chapter 7 of Cressie and Wikle, 2011). Specification of the process model for the PST model can be thought of as a linear version of the regime-dependent nonlinear model presented in Wu et al. (2013). Comparison of the posterior predictions for the HBA and PST model is carried out by using mean squared prediction error (MSPE) and the correlation between the forecasted and observed values as in McDermott and Wikle (2016).
The HBA model was implemented for all forecasted years with the same tuning parameters and prior distributions. Note, as increases, the NMF basis function approximation in (1) generally becomes more accurate. Because there is a computational cost to using higher values of , we found that appropriately balanced computational efficiency with the accuracy of the approximation.
Regarding the SST basis functions, in addition to the more traditional empirical orthogonal function (EOFs; i.e., spatio-temporal principal components) linear dimension reduction, we implemented the following nonlinear dimension reduction methods: local linear embeddings (e.g., Roweis and Saul, 2000), diffusion maps (e.g., Coifman and Lafon, 2006), kernel principal component analysis (KPCA) (e.g., Scholkopf et al., 1998), and Laplacian eigenmaps (e.g., Belkin and Niyogi, 2001). Our analysis found Laplacian eigenmaps to be the most helpful of these nonlinear methods for identifying robust analogs. Therefore separate models, one with EOF basis functions (HBA1) and a second model with Laplacian eignmap basis functions (HBA2), were implemented. Approximately of the variation in the SST data was accounted for by the first 16 EOFs (i.e. ). Laplacian eigenmaps are calculated through an eigenvector decomposition of a Laplacian matrix, whose construction involves an adjacency matrix formed through either a kernel or a nearest neighbor approach (e.g., Belkin and Niyogi, 2001). We implemented the nearest neighbor approach, with again, by sampling the number of neighbors as a parameter in the MCMC sampler over the following grid: .
We used a value of for the parameter in (5); the model did not seem overly sensitive to this choice. For and , we assigned priors and , respectively. The kernel and process error parameters are given inverse-gamma priors: (which is only moderately informative in comparison to the small scale of the Gaussian kernel in (6)) and . All models were run for 20,000 iterations with the first 2,000 considered burn-in.
Results
Prediction skill of the HBA and PST models was evaluated through calculation of the MSPE, defined as the mean of the squared differences between the posterior predicted means and the observed counts averaged across all spatial locations. The correlation between the observed counts and the mean of the posterior predictions was also used to evaluate the forecasting models, as is often considered for spatio-temporal prediction (e.g., Wilks, 2006). As displayed in Table 2, the HBA model out-performed the PST model in both 2009 and 2014, in that the HBA models had higher correlations and lower MSPE values for the two forecasted years. For 1999 and 2009, the EOF based analog model (HBA1) produced the most accurate results and the Laplacian eigenmaps model (HBA2) out-performed the EOF model in 2014. The correlation and MSPE for the hindcast appear to align with the results for the two forecasted years. We applied the model to several other hold-out years (not shown here) and found similar results, with the HBA always performing as well or better than the PST model and with the HBA1 generally, but not always, outperforming HBA2.
Figure 2 shows the hindcast and prediction maps (observed, forecasted mean, site specific lower 2.5th, and upper 2.5th percentiles) from the posterior prediction distribution. The 1999 hindcast appears to correctly predict the pattern of mallards settling more heavily in the northern region of the domain. The forecasted maps for the two out-of sample periods (2009, 2014) also appear to capture the overall pattern of the observed counts. Close examination of the uncertainty maps show that a majority of the observed counts appear to fall within the displayed 95% credible intervals. Overall, these spatial maps provide resource managers with accurate forecasts and informative intervals, thus allowing them to manage for various scenarios. It is important to note that, to our knowledge, no existing spatio-temporal nonlinear analog methods produce model based quantification of forecast uncertainty.
Discussion
Overall, many of the aspects of analog forecasting that originally made it appealing to ecologists are retained by the HBA model. The model has relatively few parameters, and performs well with data from a relatively short temporal span. Unlike other analog forecasting methods, the HBA allows users to properly quantify uncertainty in a rigorous framework. With the growing number of high-dimensional spatial-temporal ecological datasets, analog forecasting in a hierarchical framework can provide ecologists with a rich framework for making accurate forecasts with principled uncertainty quantification. The count-based spatio-temporal hierarchical Bayesian analog model methodology developed here was successful at forecasting waterfowl settling counts across multiple years. For example, the model correctly forecasted how the waterfowl settled more consistently in the northern half of the region of interest in 1999 despite the lack of correlation with patterns from the previous year. Further, there is a potential scientific explanation for this pattern. Poor habitat conditions due to drought (e.g., Wu et al., 2013), possibly linked to the tropical Pacific La Niña anomaly (e.g, Philander, 1990; Hoerling et al., 1997), could help explain why many waterfowl overflew the southern region in 1999 (e.g, Hansen and McKnight, 1964; Sorenson et al., 1998).
Due to the preponderance of zeros present in the waterfowl data, the assumption of equidispersion implicit in the data model (i.e., equation (4)) is likely violated. Wu et al. (2013) attempted to deal with the potential underdispersion in such data by using a Conway-Maxwell Poisson data model. Here we deal with this potential problem by using NMF to reduce the overall underdispersion. However, a more rigorous way to deal with the underdispersion is to use a zero-inflated Poisson (ZIP) data model (e.g., Wikle and Anderson, 2003; Ver Hoef and Jansen, 2007). Importantly, any of the various methods throughout the literature that account for underdispersion could be integrated into the presented model by adjusting the data model.
By placing analog forecasting within a hierarchical Bayesian paradigm, there are a multitude of ways in which the methodology could be extended. It should be noted that differences in the forecasts between the HBA1 and HBA2 model can be attributed to a difference in the selection of analogs. This suggests that allowing the model to simultaneously consider multiple types of basis functions is an obvious extension of the model. Through the use of a mixture model, one could potentially jointly model two or more types of basis functions. Such an approach may be useful for forecasting seasonal or yearly settling patterns that are influenced both linearly and nonlinearly by some high-dimensional variable.
Count data in ecology are ubiquitous and the model we developed is an ideal alternative to currently available quantitative methods. Ecologists routinely collect count data through visual surveys, such as the waterfowl dataset used herein, or through use of other remote technologies. For example, rapid advancement of radio-tracking technology (e.g, Kays et al., 2015) and remote-sensed cameras (e.g, He et al., 2016) have transformed the way ecologists collect count data. These widely used technologies have also changed the type of data obtained both in terms of amount and structure of resulting data. In particular, these technologies result in large data structures with spatial and temporal dependencies and our model provides an appropriate way to address these complexities while quantifying uncertainty in a rigorous manner. Often, these count data are used by ecologists to assess settling patterns, habitat relationships, or impacts of weather conditions and predict future states. For example, migration routes of terrestrial mammals are imperiled (e.g, Berger and Cain, 2014) and there is much effort to identify and predict use of important migration corridors. However, timing and use of migration corridors is affected by weather and other factors such as human disturbance. Our model provides an alternative to model and project use of these important areas while revealing factors affecting their use. Such results would have important policy decisions in wildlife management. Thus, we envision numerous applications of this model and its extensions.
Acknowledgments
This work was supported by the School of Natural Resources at the University of Missouri, the Missouri Department of Conservation, and was partially supported by the U.S. National Science Foundation (NSF) and the U.S. Census Bureau under NSF grant SES?1132031, funded through the NSF-Census Research Network (NCRN) program.
Data accessibility
The raw indicated pair counts (for the waterfowl settling pattern data) are publicly available through the FWS Division of Migratory Management (https://migbirdapps.fws.gov/). The monthly Sea Surface Temperature (SST) data are publicly available from the National Oceanic and Atmospheric Administration (NOAA) Extended Reconstruction Sea Surface Temperature (ERSST) data (http://www.esrl.noaa.gov/psd/).
Appendix A: Markov chain Monte Carlo Algorithm
The following details the MCMC algorithm used to implement the HBA model outlined above. Let represent the current iteration.
Sample using componentwise Metropolis-Hastings updates.
Let \widetilde{\mbox{\boldmath\Theta\unboldmath}}^{(\ell-1)}=\{m^{(\ell-1)},q^{(\ell-1)},\theta_{1}^{(\ell-1)},\sigma_{\eta}^{2(\ell-1)}\}. Generate a proposal value from (where is a tuning parameter) and calculate the following vectors:
\mbox{\boldmath\beta\unboldmath}_{t}^{0}=(\beta_{1,t}^{(\ell)},\dots,\beta_{j-1,t}^{(\ell)},\beta_{j,t}^{(\ell-1)},\beta_{j+1,t}^{(\ell-1)},\dots,\beta_{n_{\beta},t}^{(\ell-1)})^{\prime}
\mbox{\boldmath\beta\unboldmath}_{t}^{*}=(\beta_{1,t}^{(\ell)},\dots,\beta_{j-1,t}^{(\ell)},\beta_{j,t}^{*},\beta_{j+1,t}^{(\ell-1)},\dots,\beta_{n_{\beta},t}^{(\ell-1)})^{\prime}
\mbox{\boldmath\beta\unboldmath}_{j}^{0}=(\beta_{j,1}^{(\ell)},\dots,\beta_{j,t-1}^{(\ell)},\beta_{j,t}^{(\ell-1)},\beta_{j,t+1}^{(\ell-1)},\dots,\beta_{j,T}^{(\ell-1)})^{\prime}
\mbox{\boldmath\beta\unboldmath}_{j}^{*}=(\beta_{j,1}^{(\ell)},\dots,\beta_{j,t-1}^{(\ell)},\beta_{j,t}^{*},\beta_{j,t+1}^{(\ell-1)},\dots,\beta_{j,T}^{(\ell-1)})^{\prime}.
Next, calculate the following Metropolis-Hastings ratio:
R(\beta_{j,t}^{*},\beta_{j,t}^{(\ell-1)})=\frac{\big{[}\mbox{\boldmathY\unboldmath}_{t}\mid\mbox{\boldmath\beta\unboldmath}_{t}^{*},\mbox{\boldmath\Psi\unboldmath}\big{]}\ {\prod\limits_{t=1}^{T}}\ \big{[}\mbox{\boldmath\beta\unboldmath}_{j,t}^{*}\mid\mbox{\boldmath\beta\unboldmath}_{j,-t}^{*},\widetilde{\mbox{\boldmath\Theta\unboldmath}}^{(\ell-1)}\big{]}}{\big{[}\mbox{\boldmathY\unboldmath}_{t}\mid\mbox{\boldmath\beta\unboldmath}_{t}^{0},\mbox{\boldmath\Psi\unboldmath}\big{]}\ {\prod\limits_{t=1}^{T}}\ \big{[}\mbox{\boldmath\beta\unboldmath}_{j,t}^{0}\mid\mbox{\boldmath\beta\unboldmath}_{j,-t}^{0},\widetilde{\mbox{\boldmath\Theta\unboldmath}}^{(\ell-1)}\big{]}}.
Set with probability ; otherwise . Repeat for and . This derivation assumes each are sampled in the same order each iteration, in practice the order could be random and change each iteration. 2. 2.
Sample by performing inverse transform sampling with the discrete grid, . Evaluate the following probability:
p_{i}={\prod\limits_{t=1}^{T}}\ \big{[}\mbox{\boldmath\beta\unboldmath}_{t}^{(\ell)}\mid\mbox{\boldmath\beta\unboldmath}_{-t}^{(\ell)},m_{i}^{*},q^{(\ell-1)},\theta_{1}^{(\ell-1)},\sigma_{\eta}^{2(\ell-1)}\big{]},
for each . Calculate the C.D.F. by normalizing each (i.e., ) and use inverse transform sampling to sample . 3. 3.
Sample by performing inverse transform sampling with the discrete grid, . Evaluate the following probability:
p_{k}={\prod\limits_{t=1}^{T}}\ \big{[}\mbox{\boldmath\beta\unboldmath}_{t}^{(\ell)}\mid\mbox{\boldmath\beta\unboldmath}_{-t}^{(\ell)},m^{(\ell)},q_{k}^{*},\theta_{1}^{(\ell-1)},\sigma_{\eta}^{2(\ell-1)}\big{]},
for each . Calculate the C.D.F. by normalizing each and use inverse transform sampling to sample . 4. 4.
Sample with a Metropolis-Hastings step. Generate a proposal value from and calculate the following Metropolis-Hastings ratio:
R(\theta^{*}_{1},\theta_{1}^{(\ell-1)})=\frac{{\prod\limits_{t=1}^{T}}\ \big{[}\mbox{\boldmath\beta\unboldmath}_{t}^{(\ell)}\mid\mbox{\boldmath\beta\unboldmath}_{-t}^{(\ell)},m^{(\ell)},q_{k}^{(\ell)},\theta_{1}^{*},\sigma_{\eta}^{2(\ell-1)}\big{]}\ \big{[}\theta_{1}^{*}\mid a_{1},b_{1}\big{]}}{{\prod\limits_{t=1}^{T}}\ \big{[}\mbox{\boldmath\beta\unboldmath}_{t}^{(\ell)}\mid\mbox{\boldmath\beta\unboldmath}_{-t}^{(\ell)},m^{(\ell)},q_{k}^{(\ell)},\theta_{1}^{(\ell-1)},\sigma_{\eta}^{2(\ell-1)}\big{]}\ \big{[}\theta_{1}^{(\ell-1)}\mid a_{1},b_{1}\big{]}}.
Set with probability ; otherwise . 5. 5.
Sample with a Metropolis-Hastings step. Generate a proposal value from and calculate the following Metropolis-Hastings ratio:
R(\sigma_{\eta}^{2*},\sigma_{\eta}^{2(\ell-1)})=\frac{{\prod\limits_{t=1}^{T}}\ \big{[}\mbox{\boldmath\beta\unboldmath}_{t}^{(\ell)}\mid\mbox{\boldmath\beta\unboldmath}_{-t}^{(\ell)},m^{(\ell)},q_{k}^{(\ell)},\theta_{1}^{(\ell)},\sigma_{\eta}^{2*}\big{]}\ \big{[}\sigma_{\eta}^{2*}\mid a_{2},b_{2}\big{]}}{{\prod\limits_{t=1}^{T}}\ \big{[}\mbox{\boldmath\beta\unboldmath}_{t}^{(\ell)}\mid\mbox{\boldmath\beta\unboldmath}_{-t}^{(\ell)},m^{(\ell)},q_{k}^{(\ell)},\theta_{1}^{(\ell)},\sigma_{\eta}^{2(\ell-1)}\big{]}\ \big{[}\sigma_{\eta}^{2(\ell-1)}\mid a_{2},b_{2}\big{]}}.
Set with probability ; otherwise .
Appendix B: Procrustes Distance
Suppose we have a target object matrix and a comparison object matrix . It is assumed that and have the same dimension. To compare the two objects the comparison object is superimposed onto the target object through scaling, rotation, and translation. This transformation is carried out by using the scaling parameter and the rotation matrix . The Procrustes distance between and is defined as:
,
where the “F” subscript denotes the Frobenius matrix norm. To calculate we need to center and by their respective column means to create and . Next, we calculate the singular value decomposition of and let . The positive scaling parameter is set such that .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Arora et al. (2013) Arora, S., Little, M., and Mc Sharry, P. (2013), “Nonlinear and nonparametric modeling approaches for probabilistic forecasting of the US gross national product,” Studies in Nonlinear Dynamics and Econometrics , 17, 395–420.
- 2Badea (2008) Badea, L. (2008), “Extracting gene expression profiles common to colon and pancreatice adenocarinoma using simultaneous nonnegative matrix factorization,” Pacific Symposium of Biocomputing , 267–278.
- 3Becker (2015) Becker, P. H. (2015), “In search of the gap: temporal and spatial dynamics of settling in natal common tern recruits,” Behavioral Ecology and Sociobiology , 69, 1415–1427.
- 4Belkin and Niyogi (2001) Belkin, M. and Niyogi, P. (2001), “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” MIPS , 14.
- 5Berger and Cain (2014) Berger, J. and Cain, S. L. (2014), “Moving beyond science to protect a mammalian migration corridor,” Conservation biology , 28, 1142–1150.
- 6Berry et al. (2007) Berry, M., Browne, M., Langville, A., and R.J., P. V. P. (2007), “Algorithms and applications for approximate nonnegative matrix factorization,” Compuational Statistics and Data Analysis , 52, 155–173.
- 7Boutsidis and Gallopoulos (2008) Boutsidis, C. and Gallopoulos, E. (2008), “SVD based initialization: A head start for nonnegative matrix factorization,” Pattern Recognition , 41, 1350–1362.
- 8Cangelosi and Hooten (2009) Cangelosi, A. and Hooten, M. (2009), “Models for bounded systems with continuous dynamics,” Biometrics , 65, 850–856.
