Continuous Inference for Aggregated Point Process Data
Benjamin M. Taylor, Ricardo Andrade-Pacheco, Hugh J. W. Sturrock

TL;DR
This paper proposes continuous spatiotemporal models for aggregated count data, improving inference accuracy over traditional discrete models, with applications in epidemiology and political boundary analysis.
Contribution
It introduces novel methods for continuous inference on aggregated counts, accommodating changing and uncertain aggregation units over time.
Findings
Enhanced spatial prediction of malaria incidence in Namibia.
Improved forecasting of voting patterns under boundary changes.
Demonstrated advantages over traditional discrete models.
Abstract
This article introduces new methods for inference with count data registered on a set of aggregation units. Such data are omnipresent in epidemiology due to confidentiality issues: it is much more common to know the county in which an individual resides, say, than know their exact location in space. Inference for aggregated data has traditionally made use of models for discrete spatial variation, for example conditional autoregressive models (CAR). We argue that such discrete models can be improved from both a scientific and inferential perspective by using spatiotemporally continuous models to directly model the aggregated counts. We introduce methods for delivering (limiting) continuous inference with spatitemporal aggregated count data in which the aggregation units might change over time and are subject to uncertainty. We illustrate our methods using two examples: from epidemiology,…
| Parameter | 0m | 150m | 300m | 450m | 600m |
|---|---|---|---|---|---|
| -17.4(-18.6, -16.18) | -17.92(-19.22, -16.6) | -18.64(-19.91, -17.31) | -19.22(-20.64, -17.76) | -18.99(-20.44, -17.57) | |
| 5.03(2.59, 7.3) | 4.08(1.4, 6.89) | 3(0.19, 5.83) | 2.95(0.17, 6.17) | 1.51(-1.53, 4.82) | |
| -0.15(-0.19, -0.1) | -0.14(-0.19, -0.08) | -0.11(-0.17, -0.06) | -0.09(-0.15, -0.03) | -0.11(-0.18, -0.05) | |
| 0(-0.05, 0.04) | 0.01(-0.03, 0.05) | 0.02(-0.02, 0.07) | 0.01(-0.04, 0.06) | 0.03(-0.02, 0.08) | |
| 0.07(-0.18, 0.3) | 0.06(-0.22, 0.32) | 0.06(-0.21, 0.31) | 0.08(-0.19, 0.37) | -0.03(-0.32, 0.24) | |
| 7.35(7.18, 7.53) | 7.34(7.18, 7.51) | 7.33(7.16, 7.5) | 7.32(7.14, 7.49) | 7.33(7.15, 7.51) |
| 50% | 2.5% | 97.5% | |
|---|---|---|---|
| Intercept | 0.165 | 0.116 | 0.253 |
| EVI | 2.16 | 1.37 | 3.6 |
| LST | 1.61 | 1.13 | 2.3 |
| NDWI | 0.972 | 0.57 | 1.56 |
| 2.62 | 2.11 | 3.2 | |
| (km) | 1.82 | 1.03 | 3.02 |
| 5.64 | 5.37 | 4.24 |
| coefCON | coefLAB | coefLIB | |
|---|---|---|---|
| Intercept | 3.09 (2.86,3.37) | 1.98 (1.82,2.19) | 8.92 (6.56,1.2) |
| Income | 0.763 (0.657,0.899) | 0.955 (0.869,1.03) | 0.545 (0.327,0.792) |
| Health | 0.838 (0.736,0.918) | 1.2 (1.03,1.32) | 1.13 (0.838,1.63) |
| Crime | 0.791 (0.725,0.871) | 1.18 (1.07,1.32) | 1.44 (1.01,1.81) |
| Environment | 0.99 (0.904,1.07) | 1.04 (0.987,1.09) | 0.835 (0.676,1.01) |
| Prop. Green | 0.933 (0.902,0.968) | 0.936 (0.91,0.969) | 0.976 (0.904,1.04) |
| Prop. UKIP | 0.934 (0.88,0.996) | 0.941 (0.88,0.991) | 0.886 (0.745,1.04) |
| 0.446 (0.391,0.518) | 0.427 (0.391,0.485) | 1.2 (1.09,1.33) | |
| 4279 (3140,5778) | 5287 (3966,7323) | 7391 (6059,9274) | |
| 4.99 (1.9,1.89) | 3.65 (2.07,2.82) | 3.17 (1.62,1.48) |
| 2010 | 2015 | 2020 | |
|---|---|---|---|
| CON | 32(0.27) | 33(0.28) | 28(0.27) |
| LAB | 78(0.66) | 82(0.69) | 74(0.73) |
| LIB | 8(0.07) | 3(0.03) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPoint processes and geometric inequalities · Spatial and Panel Data Analysis · Statistical Methods and Bayesian Inference
Continuous Inference for Aggregated Point Process Data
Benjamin M. Taylor (Lancaster University), Ricardo Andrade-Pacheco
Hugh J. W. Sturrock, (University of California, San Francisco)
Abstract
This article introduces new methods for inference with count data registered on a set of aggregation units. Such data are omnipresent in epidemiology due to confidentiality issues: it is much more common to know the county in which an individual resides, say, than know their exact location in space. Inference for aggregated data has traditionally made use of models for discrete spatial variation, for example conditional autoregressive models (CAR). We argue that such discrete models can be improved from both a scientific and inferential perspective by using spatiotemporally continuous models to directly model the aggregated counts. We introduce methods for delivering (limiting) continuous inference with spatitemporal aggregated count data in which the aggregation units might change over time and are subject to uncertainty. We illustrate our methods using two examples: from epidemiology, spatial prediction malaria incidence in Namibia; and from politics, forecasting voting under the proposed changes to parlimentary boundaries in the United Kingdom.
1 Introduction
In this article, we will use the phrase ‘aggregated point process data’ (or ‘aggregated data’, for short) to refer to discretely observed data which in reality most likely arose from an underlying spatially- or spatiotemporally-continuous (point) process. A common example of aggregated data, at least in the field of epidemiology, are disease counts observed over a set of regions in space, such as the number of cancer cases in each of the counties of a US state. In this article, we will use the term ‘agregated models’ to refer to statistical models for discretely observed data that are derived from an underlying spatially continuous model; we will see that aggregated models require some sub-region level information which in practice is often available.
Discrete models for aggregated data abound (Besag, 1974; Besag and Kooperberg, 1995; Rue et al., 2009; Lee, 2016). One possible explanation for this is because point-level information is often not available for economic or confidentiality reasons; another possible explanation is convenience, for example if the desired data are collected routinely alongside other information (Beale et al., 2008; Diggle et al., 2010). To be absolutely clear about our distinction between discrete and aggregated data, we note that it is possible to fit a discrete model and obtain spatially- or spatiotemporally-continuous inference via spatial prediction. For example, we can obtain continuous inference (within regions) from the following model:
[TABLE]
Here are the observed events in region (say ), are a set of region-specific covariates and is the value of a continuous spatial process (e.g. a spatial Gaussian process) at some point within region , such as a population-weighted centroid. The Bayesian paradigm advocates inference via the posterior density, , where are parameters describing the properties of the process , e.g. marginal variance and spatial correlation. From this model we can predict the process anywhere on the plane, so we can obtain a prediction of risk for any point in space at which appropriate values of can be found. We regard such models as being spatially discrete because they do not attempt to model the process at the sub-aggregate level.
A more common approach for making inferences for aggregated Poisson counts is to use a spatially discrete model (e.g. a conditional or simultaneous autoregressive or a Besag-York-Mollie model). While the choice of a discrete model with CAR structure may be advantageous in terms of software availability and to some extent, computational efficiency, this type of modelling approach does have some drawbacks:
- (i)
the notion of ‘neighbours’, upon which the dependence, or lack of dependence between regions in a CAR model is defined is somewhat contrived: neither the size nor shape of regions, nor their internal composition (in terms of population characteristics and location) is taken into consideration in the modelling. A second point is: how should we define neighbours when there are overlaps between regions, or uncertainty in the precise boundary? 2. (ii)
for regions that vary widely in shape and size, the covariance structures implied by CAR/SAR models can have counter-intuitive and unappealing properties, see Wall (2004); 3. (iii)
the modelling assumption of jumps in rates between neighbouring regions is unnatural – in most cases we would expect risk to vary continuously in space; and 4. (iv)
one has to be particularly careful in interpreting aggregated models in order to avoid ecological bias (Wakefield and Lyons, 2010; Wakefield et al., 2011). For example: individuals living in a region with a large frailty term will not all have the same estimated excess risk attributed to the region.
For certain types of data, aggregation need not be as limiting a factor as it may first seem. The kind of data for which the method proposed in the present article can be applied are those in which the locations of the individual cases (were they known) could be considered as a realisation of a spatiotemoporal Poisson point process Møller and Waagepetersen (2004); Baddeley et al. (2015). In the present article, we extend data-augmentation methods developed for aggregated spatial log-Gaussian Cox processes (Li et al., 2012; Taylor et al., 2015) in order to deliver continuous inference for aggregated spatiotemporal log-Gaussian Cox processes with overlapping and uncertain boundaries. We do this by incorporating into our model for the aggregated counts additional population and covariate data at the sub-regional level. Such population data are often available e.g. from the Gridded Population of the World CIESIN (2014).
If we knew the exact location of cases at the sub-aggregation level, then we would be able to fit a spatially continuous point process model e.g. Taylor et al. (2015). Our proposed method employs a data augmentation step for generating putative case locations at the sub-aggregation level using multinomial sampling (making use of additional sub-aggregation-level information) whence the parameters of the continuous model can be updated conditional on the new putative locations. Iterating these steps yields a Gibbs sampler that generates both model parameters for the continuous model as well as realisations of the cases: which can be used to transfer inference from one discretisation to another, among other things. The main novelty in the present article is to extend the research of (Li et al., 2012; Taylor et al., 2015) to include the case of spatiotemoporal aggregated data where regional boundaries change over time and may be uncertain. We show how this can be achieved with a simple modfication of the multinomial sampling step. Our modelling approach offers an elegant solution to the issues detailed in (i)–(iv) above.
The methodology discussed in the present article is related to the concept of downscaling (Berrocal et al., 2010) which, rather than being concerned with the modelling of event counts, is concerned with combining (continuous) information measured at multiple scales, such as that measured by (point-referenced) monitoring locations in continuous space and outputs from numerical models of pollution, say (Fuentes and Raftery, 2005; Matisziw et al., 2008).
Regarding the issue of ecological bias (mentioned above in respect to the disadvantages of discrete models), this occurs when we try to interpret parameter effects at one hierarchy as though they belonged to a different hierarchy, for example we try to interpret aggregate effects as individual-level effects. The aggregate modelling framework we propose in the present article does not completely avoid ecological bias, rather, it enables us to incorporate into our model information at the sub-aggregate level, and thereby come closer to the level of interpretation that might achieved under a truly continuous model.
This article is organised as follows. In Section 2, we introduce notation and discuss the simplest case of aggregated spatial data, where there is no overlap between regions and no uncertainty regarding boundaries. We then extend these methods to overlapping (Section 3), uncertain (Section 4) and spatiotemporal datasets (Section 5). In Section 6, we assess the effect of the amount of overlap on inferences in a spatial dataset for which we know the true location of the cases. Next, in Sections 7 and 8, we apply our methods to real datasets: respectively modelling spatiotemporal incidence of malaria in Namibia and modelling voting patterns in the last two UK general elections with the aim of predicting what will happen under the proposed boundary changes due to take effect in the 2020 general election. Our article closes with a discussion in Section 9.
2 Aggregated Spatial Models with Trivial or Empty Intersection of Regions
As stated above, we assume the existence of some underlying continuous point process that is responsible for the true, but unobserved pattern of points on the observation window of interest. More specifically, in this article we will assume a Cox model for the intensity function, , of this process, writing the log-intensity for respectively spatial and spatiotemporal processes as:
[TABLE]
Where (or ) is a known multiplicative offset for the intensity function, (or ) is a vector of covariates measured at each location in space (space-time), are unknown parameter effects (to be estimated) and is a spatial/spatiotemporal process. In practice it is often convenient to assume is a Gaussian process (yielding the log-Gaussian Cox process (LGCP)) – and we will do just that in the present article, but the reader should be aware that the models and methods we describe are extensible to a non-Gaussian too. We will assume that the properties of the process are parameterised by a vector . We will regard this continuous model as our ‘true’ model for the aggregated case counts.
As an example of the techniques we use in this article, we begin with the spatial case, by assuming the discete regions, , have at most trivial (line or point) or empty intersection. Inference in this case has previously been described in Taylor et al. (2015); Diggle et al. (2013); Li et al. (2012). In this case, our continuous process is not directly observed, that is we do not observe the exact locations of our points, rather we observe the total number of cases, , in region , .
In order to perform inference for this model, we aim to produce samples from
[TABLE]
To do this, we use the technique of data augmentation (van Dyk and Meng, 2001). Using a fine grid on which to represent our continuous spatial model (with a piecewise-constant representation of the process on the grid), we introduce parameters representing the number of cases falling into each grid cell, which is unknown. Instead of sampling from (2) directly, we sample from
[TABLE]
noting that (2) is a marginal of this density. Sampling from 3 can be achieved using a Gibbs scheme, alternately drawing from:
[TABLE]
If we know , then is implied, so . We can use any sampling method we like for the continuous version of the model to draw , and from this density. The density is a multinomial density, and so is straightforward to sample from. In practice, it is not necessary to strictly alternate the sampling between these two densities and, depending on the efficiency of the former sampler, it may pay to take several steps in order to increase independence between the successive values of , and used to perform the multinomial sampling.
One of the subtleties of this method is the need to compute all intersections between the regions and the computational grid on which inference for the continuous surface takes place. This allows us to correctly assign cases to grid cells in the multinomial step of the algorithm: as well as identifying which regions intersect with at particular grid cell, it is also necessary to calculate the area of intersection, the probability a case from region belongs to cell , , is
[TABLE]
where is the covariate and value for cell and and are respectively the current value of the parameters and from the MCMC chain used to sample those parameters, the latter being the spatial effect for cell . Note we use the same discretisation for the covariates as we do for the spatial process.
In general these computations incur operations, where is the number of grid cells and is the number of regions. Thankfully it is only necessary to perform these computations once: for repeated analyses on the same grid and regions the information can be reused.
In Taylor et al. (2015), the authors compare continuous and aggregated inference from a dataset concerning cases of primary biliary cirrhosis in the Newcastle-Upon-Tyne area in the UK. They show that both models produce empirically equivalent inference, albeit the prediction surfaces, as expected, are smoother for the aggregated model.
3 Aggregated Spatial Models with Non-Trivial Intersection of Regions
For datasets where the set of regions are overlapping non-trivially (that is such that ) the sampling procedure described in Section 2 needs to be modified. These modifications concern the multinomial step of the Gibbs update.
In epidemiology, the most obvious case of overlapping regions occur when data are collected at the health facility level. In the past this has been treated at worst as a point and at best as set of discrete non-overlapping regions. In reality, there will be considerable overlap in regions, arising from health care preferences, availability or whimsically. This issue is still present for aggregated facility level data, i.e. aggregating to districts: facilities that lie near the border of a district will be aggregated to the region they lie in, when in fact cases could have come from across the border. Another example is survey data: cross sectional surveys have traditionally been treated as points (e.g. school surveys for worms/malaria) but in truth, these surveys come might from a school catchment, for example, which often overlap.
Suppose we observe events falling into region i.e. . The multinomial sampling step described in Section 2 (specifically Equation 4) assigns each of these events independently to cells intersecting region with probability proportional to . When regions overlap, this would lead to the incorrect number of events being assigned to cells. Consider Figure 1 (a), this shows two overlapping regions within each of which the intensity of the process is uniform, when the regions are overlapped, the intensity in is greater than in either or . Another way of looking at this is that if the underlying was constant over space for each of the small grid cells in Figure 1 (a), then the multinomial step described in the section above would not correctly allocate events to grid cells, the result would be inhomogeneous, rather than homogeneous.
Therefore, if we wish to achieve a desired intensity in the overlap region, then we need to correspondingly reduce for cells that intersect with any intersection between two or more regions, which allows the counts in cells to be contributed from any region involved in the intersection. Figure 1 (b) shows the types of of intersections we need to account for when adjusting these probabilities. In this figure, three regions overlap a grid cell, but of course in a real-life dataset there may be many more overlapping regions and these may intersect non-trivially, for example may not be a connected set.
In Figure 1 (b) there are different intersections to take acount of (not including the area of that does not belong to any ) and in general, if regions overlap a cell, there will be intersections to account for. A further complication is that if (on the unobserved continuum) an event falls in the intersection of regions i.e. for some , we have , then may not be assigned to the aggregate total for each of with equal probability i.e. the condition
[TABLE]
may not be true. In practice, one might assume that each region has an associated measure of ‘sampling effort’, so that for example the allocation of points falling in an intersection can be assigned to region region with probability . Two further alternatives are as follows: firstly, we could also allow the sampling effort for region to vary between computational cells that intersect that region, leading to cell-specific allocation probabilities ; secondly, we could allow sampling effort to vary between all multi-way intersections of regions (see below), leading to an allocation probability of for each intersection. However, in real life applications we suspect these last two quantities, and in particular the latter, may be difficult to ascertain.
In the case of two regions and intersecting a grid cell , the correct modification to Equation 4 for region would look like:
[TABLE]
where is as defined in Equation 4. In general for a grid cell in which regions potentially intersect, the scaling factor for (i.e. the term ) would become rather complex; fortunately this term can be seen as an expectation, which can be evaluated approximately using Monte Carlo methods. Let denote the partition of cell into disjoint subsets i.e. each element of is one of the possible multi-way intersections of the regions overlapping cell discussed above and thus .
Examining the structure of , it can be seen that for a general region intersecting cell we will have:
[TABLE]
since for . We can write this as an expectation taken with respect to a uniformly ditributed random variable on :
[TABLE]
Another way to view is as \mathbb{P}(\text{xA_{i}C_{j}}).
If sampling effort is constant across space, then
[TABLE]
and we can estimate by drawing iid and using
[TABLE]
This method is implemented as an extension to the R package lgcp: the main assumption is that the are known; if this is not the case, a pragmatic (but informative) assumption might be whence events are allocated at random i.e. without preference for a particular .
4 Aggregated Spatial Models with Uncertainty in Regional Boundaries
Our discussion in Sections 2 and 3 has thus far focussed on the case that each region is a known fixed area in space. However, in practical problems, there can be uncertainty in catchment boundaries. For example, while health facility catchment area boundaries can be defined by governments or via models, individuals may not seek treatment at their nearest facility (Noor et al., 2006; Guagliardo, 2004; Tanser et al., 2001; Akin and Hutchinson, 1999). Equally, health facility catchments may span more than one district. Aggregated case numbers may therefore include cases who acquired their infection in a neighboring district. Note that as in Section 3, these districts may be overlapping.
To account for uncertainty in areas we replace these fixed regions by stochasic regions the physical size, shape and location of which is controlled by a random variable . There are many choices for the statistical distribution of , , and in practice, the way we choose it this will be context dependent (on the spatial sampling strategy we are trying to describe). It could be that we wish to impose different parametrisations on each region, and we may even wish to treat some regions as fixed. Here are two examples
could be a contraction or expansion by a factor of some fixed polygon towards or from the centroid (or some other appropriate point). in this case would be a positive-valued continuous random variable. 2. 2.
could be selected at random from a set of fixed polygons (or a combination of points and polygons) with weight proportional to . In other words would be a discrete random variable describing the mixture index and taking values in with being selected with probability .
In a similar way to how we modified Equation 4 to account for overlapping regions, we can use the same technique in the case of uncertain boundaries. Each realisation of parameters drawn from the joint distribution of , , (we permit correlation in the s) gives rise to a different partition of space into regions and a different set of intersections for each cell, . For this choice of parameters , Equation 5 holds with a minor extension of notation as detailed in the present section, but in order to calculate an we now need to take account of the uncertainty in . Let denote the th sample of realisations . we propose using the following
[TABLE]
Note that we are assuming the hyperparameters of the joint density of are known. While of course it would, in theory, be possible to learn these parameters by adding an additional hierarchy into our model, doing so would affect our ability to identify correlation in the Gaussian field, an already very challenging inferential problem Zhang (2004). We are also assuming that is independent from , , .
5 Aggregated Spatio-Temporal Models
Aggregated data are often captured at the regional level over a period of time and when this is the case, it is usually of scientific interest to model the process in both space and time. A common practical issue with aggregated data collected over a period of time is that it is often the case that regional boundaries change, or that regions merge, or split apart. Over time, we could also encounter any combination of the types of overlapping discussed in Sections 2–4.
Extension of the methods above to the spatiotemporal case is in fact quite straightforward in principle, although with the addition of the temporal dimension there will obviously be a greater computational burden. We require a continuous model for the spatiotemporal process,
[TABLE]
where is the number of and a method to deiver inference for that process on a fine grid. Such methods are discussed in Brix and Diggle (2001), Diggle et al. (2005) and Taylor et al. (2015), for example. These authors employ a Metropolis-adjusted Langevin algorithm, but more sophisticated sampling schemes, such as the Riemann manifold Hamiltonian Monte Carlo sampling scheme discussed in Girolami and Calderhead (2011) can also be extended to deliver inference for the Cox process. The method introduced by Brix and Diggle (2001) advocates analysing a series of spatiotemporal data by splitting them into chunks; the time dimension, as well as the space dimension are discretised. The main idea is to exploit the fact that for most inferential problems we find that not all data in the history of the process will affect the intensity at time : we might only need data between time and to deliver inference for time . We can therefore provide inference for a series of times by combining inferences from .
We will assume that inference proceeds on a fine spatiotemporal grid, in which case the data augmentation scheme described in Section2 applies once more: we alternately update , and conditional on followed by conditional on , and . A number of options are available in terms of the continuous model, for cells and on the computational grid (where represents space and represents time), two examples could include: a separable spatiotemporal covariance function for :
[TABLE]
or a non-seperable function, which allows for interaction between space and time:
[TABLE]
here assuming a is second-order stationary.
6 How Does The Amount of Overlap Affect Inference?
In this section, we use a point process dataset of 415 geo-referenced cases of definite or probable primary biliary cirrhosis (PBC) alive between 1987 and 1994 to illustrate the effect of increasing the amount of overlap between regions on inference. These point-referenced data were originally analysed in Prince et al. (2001) and later Taylor et al. (2015) compared inferences obtained from an aggregated version of the data to those obtained assuming an effectively continuous risk surface.
The left plot in Figure 2 shows the locations of the 415 cases and the right hand plot shows the total number of cases in each of the aggregation units. We used a subset of the indices of deprivation originally measured at the Lower Super Output Area (LSOA) level as covariates in our Poisson regression model: Income, Barriers, Environment and also included an intercept term. We only used a subset of the IMD domains because there was some collinearity between the domains measured in this region: where two variables were correlated, we removed the domain least likely to explain the incidence. We rasterised the covariates onto the grid shown in the middle and right-hand plots in Figure 3. We included a population offset from data taken at the LSOA level and rasterised as per the covariates. We created five datasets based on the point-locations and the aggregation units illustrated in the left column of Figure 3.
The five different aggregation units were created by enlarging each of the individual regions using buffers of size 0 (giving the data illustrated in the righ hand plot of Figure 2), 150, 300, 450 and 600 metres and cropping the resulting regions so the boundary was identical to that in Figure 2. Each point that fell into a non-trivial intersection of regions was randomly allocated to one of the regions. We used an exponential model for the stationary spatial covariance function, so that two points a distance apart would have the following covariance: .
Full details of the MCMC scheme employed in the analysis are given in Taylor et al. (2015). We ran the MCMC chains for each dataset for 500,000 iterations with a 10,000 iteration burnin and retaining every 490th sample. For the covariates, we used zero-mean independent Gaussian priors with standard deviation 1000, the log standard deviation of the marginal variance of the latent field, had a Gaussian prior with mean 0 and standard deviation 0.3 and the prior for the spatial dependence parameter, had mean and standard deviation ; the latter was chosen so as to avoid numerical singularity in the discrete Fourier transform employed to handle matrix operations (see Taylor et al. (2015) for further detail) and gives a range of up to approximately 1/5 of the height of the observation window.
The parameter estimates from each of the datasets is given in Table 1. Examining this table, it can be seen that the effect of income deprivation on incidence appears to be attenuated in magnitude and level of significance by increasing the amount of overlap with an effect size ranging between 5.03(2.59, 7.3) when there is no overlap to 1.51(-1.53, 4.82) when the overlap is 600m. The effect of changing the overlap on the barriers domain of the IMD was less pronounced with effect sizes being broadly similar, ranging for example from -0.15(-0.19, -0.1) in the no overlap dataset to -0.11(-0.18, -0.05) for the 600m overlap dataset. There was little difference in the poterior estimates of the enviroment deprivation covariate and also in the estimates of and . Though examining a plot of the prior and posterior for , it appears that the data provide little information on this parameter.
Figure 3 shows how the estimated mean and exceedance probabilities change under these difference levels of aggregation. These plots show similar pattern in the estimated mean and exceedances across the different scenarios, with visual differences being most likely attributable to the specific allocation of points to the aggregation units in each of the scenarios.
7 Application: Analysis of Health Facility Data in Namibia with Overlapping Catchment Areas
In this section, we analyse a spatiotemporal dataset from northern Namibia. These data consist of monthly counts of malaria cases over a 2 year period reporting to one of 16 clinics/hospitals in the study area, shown in Figure 4. Malaria incidence in this area is at the pre-elimination level, with approximately 5.8 cases reported per month.
Figure 5 shows the total number of cases of malaria reported to each of the health facilities over each month of the study period. We fitted a spatiotemoporal log-Gaussian Cox process to the monthly case counts. The grid-level version of this model took the form:
[TABLE]
In the above model, is the total number of events in the cell indexed of the computational grid at time index , is the rate of the Poisson process, is the cell area, is a known offset, is a vector of measured covariates (see below) and is the value of the latent Gaussian process at the centroid of computational grid cell .
We assumed a separable covariance function for the dependence structure of , induced through a transformation of a sequence of serially-correlated random variables, where is the maximum number of time points and is a column vector. We used the transformation , where is the marginal standard deviation of and is the spatial dependence parameter, for which, we assumed an exponential correlation function. The matrix was the matrix ‘square root’ of the covariance matrix of the spatial process conditional on and evaluated at the centroids of the spatial grid, with cells arranged in lexicographic order as in Møller et al. (1998).
The motivation for working with a transformation for is because we can put a simple prior on :
[TABLE]
where and is the time between the th and the th time index. We completed our model by assuming Gaussian priors for the log of , and :
[TABLE]
and independent priors for each component of . These are relatively diffuse priors, with the exception of the prior for , which apriori sets the range of spatial dependence to be up to around 1/5 of the width of the observation window: this is required to avoid numerical sigularities in matrix computations, see Taylor et al. (2015). Note that the units of are in latitude/longitude in this example.
As in Møller et al. (1998) and Taylor et al. (2015), we used the fast Fourier transform (FFT) to handle matrix computations (which gives a massive saving in terms of speed and also disk usage) and an adaptive Metropolis Langevin Algorithm (MALA) to draw from . While these choices above reflect our personal preference for model and MCMC algorithm, partly due to our significant previous experience with MALA; the reader should be reminded that essentially any MCMC algorithm can be used to sample from this density. Our choice has the following advantages:
- •
Ignoring the convergence rate of the MCMC chain, which is partially data-dependent (in this example we achieve convergence and good mixing in 500,000 iterations), the method is in time, and in space, where is the number of spatial computational cells.
- •
Due to not having to store full covariance matrices, the method is in terms of storing .
One disadvantage is the wrap-around effects from the FFT which we can avoid, as in this example, by extending the grid by two in each direction. We used a computational grid, yielding output on a grid.
We used the following covariates: monthly enhanced vegetation index (EVI), monthly normalized water difference index NDWI, and monthly land surface temperature (LST), each of which was available on a raster for each month from the MODIS satellite. Although we had access to it, we did not include elevation as a covariate in our model because it did not vary much over our study region. THe EVI, NDWI and LST variables were obtained via Google Earth Engine. We lagged each of the included covariates by one month for our analysis. Malaria case data occurring between May 2010 to May 2012 were collected retrospectively from patient registers from the 16 health facilities in Engela district, Namibia. Cases were defined as those individuals with a positive malaria diagnosis by rapid diagnostic test (RDT). 2010 population density data for the district were obtained from WorldPop (www.worldpop.org.uk).
The probability of attendance at each of the 16 facilities was calculated using a gravity Huff model Huff (1964). This means that the probabilities are assumed to be directly proportional to the relative ‘utility’ that each facility represents to a patient. In our case, we defined the travel time to the health facility as the only driver of the utility function. Then the probability of patient attending facility can be expressed as
[TABLE]
where is the travel time from the household of patient to facility , and is a sensitivity parameter of the model. We created the 16 catchment boundaries by using a cutoff of 20% on the probabilities .
As in the cirrhosis analysis above, we ran the MCMC chain for 500,000 iterations with a 10,000 iteration burnin and retaining every 490th sample for analysis.
Figures 6 and 7 shows the probability that the relative risk exceeds 1.5; the former for months 1–12 and the latter for months 13–24. Month 25. Table 2 shows the fixed effects from the model, which show a significantly increase in risk of malaria with increasing EVI and LST; NDWI was not significant in this analysis. These results also show that there is little dependence between time points for these months.
8 Application: Evaluating the Effect of Proposed Election Boundary Changes Around the Northern Powerhouse
In September 2016, the BBC reported that the House of Commons is considering reducing the number of UK Members from 650 to 600 (BBC, 2016a, b). It is natural to seek to understand the potential impact these proposed (major) changes to the UK political constitution might have on voting patterns. In this section, we analyse data from the 2010 and 2015 general elections and look to evaluate the impact of the proposed boundary changes in the North West in an area around the City of Manchester. In this section, we use aggregated point processes to model and forecast the number of votes for the Conservative, Labour and Liberal Democrat parties.
The (arbirtary) area we chose to analyse extends from the west coast as far east as York; and with York also as the most northerly city, it extends as far south as Uttoxeter. This area includes the cities of Blackpool, Chester, Doncaster, Liverpool, Leeds, Manchester, Mansfield, Preston, Sheffield and York. We did not include the 2005 election results in our analysis because we were unable to obtain a reliable shapefile containing the results, though had we been able to, this would also have illustrated how our method works with changing boundaries over time. The results of the 2010 and 2015 general elections are shown in Figure 8.
Figure 9 shows the proposed changes to the boundaries in this area. Probably the most interesting feature in the left hand plot concerns the proposed changes to the three Liberal Democrat seats, including the seat of the former party leader, Nick Clegg. Under the present electoral boundary proposal, each of these three seats is proposed to be split into at least two further regions that at present have a majority Labour presence.
In terms of data available at the sub-aggregate (i.e. sub-electoral ward) level, we obtained 2010 population and deprivation data (the Index of Multiple Deprivation, IMD) at the Lower Super Output Level (LSOA) level from respectively the Office for National Statistics and the Department for Communities and Local Government. We included in our model the subset of IMD domains: income, health, crime and environment. Of the other three domains, barriers, education and employment, the latter two were highly correlated with income (0.88 and 0.92 respectively) and the barriers domain is difficult to interpert in the current context (it encompasses indicators for overcrowding, homelessness, access to owner-occupation (housing), road distance to GP surgery, road distance to store/supermarket, road distance to primary school and road distance to post office) (Department of Communities and Local Government, 2011). In addition to the deprivation measured, we included as indicators for the prevalence of certain societal attitudes in each electoral ward the proportion of people voting for the green party and for UKIP in the general election of 2015. Each of the IMD and societal covariates was rasterised to the computational grid and having done this, we standardised each of the covariates to allow us to compare which factors were the most important by examining the regression coefficients.
We used the same grid-level spatiotemporal model as per the malaria example above, and again used Gaussian priors for the log of , and :
[TABLE]
and independent priors for each component of . We used a non-extended grid and ran each sampler for 1,000,000 iterations, discarding the first 50,000 as burnin and thereafter retaining every 950th for analysis.
In this example, we analysed the number of votes divided by 100 for each of the Conservative, Labour and Liberal democrat parties using three separate aggregated point process models. We divided the number of votes by 100 because mixing was slow for the full number of votes (in 2015, there were in total 1,611,910 Conservative, 2,380,969 Labour and 339,985 Liberal Democrat votes).
Table 3 shows the paramater estimates from our model for each of the parties; the difference between an estimated effect size for covariate , and 1 i.e. is an indicator of the size of the effect the s can be compared both within and between models. The estimated effects, are essentially relative risks (RR): values above 1 indicate that greater deprivation in a particular domain is associated with more votes and values below 1 indicate that lesser deprivation in a particular domain is associated with fewer votes.
With these interpretations in mind, and in reverse order of effect size: a greater number of votes cast for the Conservative party was related to lower income deprivation RR 0.763 (0.657,0.899), lower crime deprivation RR 0.791 (0.725,0.871), lower health deprivation RR 0.838 (0.736,0.918) but not related to environment deprivation 0.99 (0.904,1.07); a greater number of votes cast for the Labour party was associated with greater health deprivation RR 1.2 (1.03,1.32) and greater crime deprivation RR 1.18 (1.07,1.32) and there was inconclusive evidence for there being greater numbers of votes in areas that were less income deprived RR 0.955 (0.869,1.03) and in areas with greater environment deprivation RR 1.04 (0.987,1.09); a greater number of votes cast for the Liberal Democrat Party was associated with less income deprivation RR 0.545 (0.327,0.792), but greater crime deprivation RR 1.44 (1.01,1.81) and there was inconclusive evidence for there being greater numbers of votes in areas of lower environment deprivation RR 0.835 (0.676,1.01) and higher health deprivation RR 1.13 (0.838,1.63).
Clearly, we expect there to be proportionally fewer votes for each of the three main parties in the presence of greater numbers of voters for other parties and this is reflected in the coefficients for proportion Green and proportion UKIP being below 1 across all parties. Of the remaining coefficients the value of is higher for the Liberal Democrats, which indicates voter support is spatially more sporadic.
Figure 10 and Table 4 show the forecast majority, based on the largest number of predicted votes. These predictions were computed in a two stage process. Firstly, for each retained MCMC sample, we can sample from the model given those parameters which yields a predicted number of votes for each party in each computational grid cell. Then we aggregate the predicted cell counts in each new electoral ward by summing the cell-level counts, weighted by the proportion of each cell contained in the electoral ward. For each party and region we therefore obtain 1000 realisations of the predictive distribution for the number of votes and declare the victor to be the party with the largest predicted median number of votes. Table 4 gives the number and proportion of seats for each party in 2010, 2015 and predicted for 2020 under the boundary changes. The results show that Labour, at least in this area of the country, is set to benefit from the proposed changes at the expense of the Liberal Democrats.
One thing not clear from Figure 10 is the uncertainty surrounding these predictions. Therefore, we also computed the probability of a victory for each of the parties as the empirical proportion of times each party took the majority share under each of the 1000 simulated aggregated vote counts; these are shown in Figure 11.
The purpose of this analysis was to illustrate how our methods can be used to deliver inference on a dataset in which predicting the counts on a new partition is important. A more rigorous analysis of the data might seek to model competition for votes between the parties and in particular, some care should be taken in interpreting the predicted probabilities in Figure 11, as the main assumption is independence in the predicted number of votes. One way of extending this model would be to use a multivariate spatiotemporal LGCP to jointly model the counts for all parties simultaneously.
9 Discussion
In this article, we have introduced a method for delivering continuous inference from aggregated data in which the aggregation units potentially overlap, are uncertain and change over time. We have investigated how inferences are affected by differing amounts of overlap and have illustrated our methods on two real-world examples. Our proposed method could find wide application in the modelling of aggregated count data, since these types of dataset are very common. It is also commonplace for administrative boundaries to change over time, and though we have not analysed such a dataset in the present article, by utilising a common inferential grid at each time point our method offers an elegant solution to this issue. The main strength of our proposed approach is that we have attacked the problem of inference for aggregated data from the modelling perspective: we advocate modelling the stochastic process that is hypothesised to have generated the data, rather than by modelling the data as it happens to have been stored.
We have used the term ‘continuous’ to describe the type of inference we get from our proposed model and method of analysis, even though in truth we deliver inference on merely a different discretisation of space. We believe the use of the word ‘continuous’ can be justified on the basis that we are at liberty, computationally permitting, to make our inferential grid as fine as we like, and in the limit, we approach continuity. The same cannot of course be said of the covariates. In most cases, covariates can only be measured on a finite division of space (e.g. a raster image, or a multipolygon), thus our ability to make truly continuous inferences is in some sense dictated by the level of discretisation of the covariates. An interesting area for future research would be to introduce additional model hierarchies to handle mapping of the covariates in a smooth manner onto the computational grid. A second reason for describing our methods as ‘continuous’ is that we have employed non-Markovian models for spatial correlation in the field ; the extension of our methods to Markov models or low-rank models of spatial/spatiotemporal correlation is trivial.
One thing we have not fully explored in the current article is the extent to which uncertainty in stochastic boundary definition affects our ability to deliver inference for key model parameters. As mentioned in Section 4, it is clear that in this situation, our ability to estimate other model parameters will be compromised. We conjecture that there is no general solution to this issue, and much of our ability to estimate parameters will depend on the available data and suitability of the model.
Another simple extension of our proposed modelling framework is where we have a mix of point-referenced and aggregated data. Recall that in the Gibbs scheme, we alternately draw from: and , denoting by the cell counts implied by the point-referenced data, we instead draw from and . The first of these densities simplifies to and sampling proceeds as before (the cell counts just being ) and the second density also retains its form as a multinomial as long as there is no spatial bias in the sampling of (Diggle et al., 2010).
The main disadvantage of our method is currently computational cost. However, many aspects of our proposed algorithm can be performed in parallel, including the fast Fourier transform and obviously matrix multiplications. We have already developed CUDA GPU code to implement inference for the point-referenced spatial log-Gaussian Cox process with some success: the parallel implemetation being just over 10–30 times faster than our R implementation (depending on the size of the inferential grid and GPU availability). We are currently developing CUDA code and an R interface to deliver inference for the models described in the present article.
Acknowledgements
The authors would like to thank Davis Mumbengegwi and Joyce Auala from The University of Namibia for use of the malaria case data in Section 7 and Emanuele Giorgi for providing feedback on a draft of the manuscript.
In our R implementation, the packages sp and rgeos have been particularly important for handling spatial geometries and computing intersections between polygons (Pebesma and Bivand, 2005; Bivand and Rundel, 2013, 2016).
The IMD and LSOA data in this article are subject to the Open Government License: http://www.nationalarchives.gov.uk/doc/open-government-licence/.
Contains National Statistics data ©Crown copyright and database right 2001, 2004, 2007
Contains Ordnance Survey data ©Crown copyright and database right 2001, 2004, 2007
Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Akin and Hutchinson (1999) Akin, J. and P. Hutchinson (1999). Health-care facility choice and the phenomenon of bypassing. Health Policy and Planning 14 (2), 135–151.
- 2Baddeley et al. (2015) Baddeley, A., E. Rubak, and R. Turner (2015). Spatial Point Patterns: Methodology and Applications with R . Chapman & Hall/CRC Interdisciplinary Statistics.
- 3BBC (2016 a) BBC (2016 a). Boundary changes: Why UK’s political map is being re-drawn. http://www.bbc.co.uk/news/uk-politics-37337175 .
- 4BBC (2016 b) BBC (2016 b). Boundary reform: How is your area affected? http://www.bbc.co.uk/news/uk-politics-32695546 .
- 5Beale et al. (2008) Beale, L., J. Abellan, S. Hodgson, and L. Jarup (2008). Methodologic issues and approaches to spatial epidemiology. Environmental Health Perspectives 116 (8), 1105–1110.
- 6Berrocal et al. (2010) Berrocal, V. J., A. E. Gelfand, and D. M. Holland (2010, 12). A bivariate space–time downscaler under space and time misalignment. The Annals of Applied Statistics 4 (4), 1942–1975.
- 7Besag (1974) Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological) 36 (2), 192–236.
- 8Besag and Kooperberg (1995) Besag, J. and C. Kooperberg (1995). On conditional and intrinsic autoregression. Biometrika 82 (4), 733–746.
