A functional-model-adjusted spatial scan statistic
Michael Genin, Mohamed-Salem Ahmed

TL;DR
This paper presents a novel spatial scan statistic that adjusts for longitudinal confounders using functional models, improving cluster detection accuracy in spatial epidemiology.
Contribution
It introduces a functional-model-adjusted spatial scan statistic based on generalized functional linear models, applicable to various probability models, enhancing covariate adjustment in cluster detection.
Findings
Method is equivalent to conventional spatial scan with covariate adjustment in Poisson models.
Simulation shows improved accuracy over existing methods.
Applied to mortality data with unemployment rate as confounder.
Abstract
This paper introduces a new spatial scan statistic designed to adjust cluster detection for longitudinal confounding factors indexed in space. The functional-model-adjusted statistic was developed using generalized functional linear models in which longitudinal confounding factors were considered to be functional covariates. A general framework was developed for application to various probability models. Application to a Poisson model showed that the new method is equivalent to a conventional spatial scan statistic that adjusts the underlying population for covariates. In a simulation study with univariate and multivariate models, we found that our new method adjusts the cluster detection procedure more accurately than other methods. Use of the new spatial scan statistic was illustrated by analysing data on premature mortality in France over the period from 1998 to 2013, with the…
| Model | Cluster | # départements | relative risk | P-value | |
|---|---|---|---|---|---|
| Model 1 | 1 | 4 | 1.28 | 4648.24 | 0.001 |
| 2 | 7 | 0.79 | 3225.33 | 0.001 | |
| 3 | 9 | 0.86 | 2939.85 | 0.001 | |
| 4 | 4 | 1.28 | 1131.82 | 0.001 | |
| 5 | 3 | 1.18 | 856.63 | 0.001 | |
| 6 | 2 | 0.80 | 827.15 | 0.001 | |
| Model 2 | 1 | 12 | 1.19 | 1531.91 | 0.001 |
| 2 | 8 | 0.86 | 1458.09 | 0.001 | |
| 3 | 3 | 0.85 | 1120.88 | 0.001 | |
| 4 | 3 | 1.20 | 1091.54 | 0.001 | |
| 5 | 2 | 0.77 | 1090.26 | 0.001 | |
| 6 | 3 | 1.08 | 405.53 | 0.001 | |
| Model 3 | 1 | 6 | 0.86 | 916.17 | 0.001 |
| 2 | 3 | 1.17 | 795.61 | 0.001 | |
| 3 | 4 | 1.19 | 511.30 | 0.001 | |
| Model 4 | 1 | 3 | 1.24 | 1455.90 | 0.001 |
| 2 | 2 | 0.74 | 1398.57 | 0.001 | |
| 3 | 5 | 1.21 | 917.15 | 0.001 |
| Cluster | Univariate model | Multivariate model | Functional model | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Power | TP | FP | Power | TP | FP | Power | TP | FP | ||||
| 1 | True | 0.012 | 0.014 | 0.100 | 0.075 | 0.169 | 0.104 | 0.083 | 0.183 | 0.109 | ||
| Fake | 0.914 | 0.853 | 0.022 | 0.039 | 0.106 | 0.110 | 0.041 | 0.108 | 0.116 | |||
| 1.2 | True | 0.034 | 0.037 | 0.103 | 0.143 | 0.274 | 0.100 | 0.142 | 0.277 | 0.099 | ||
| Fake | 0.893 | 0.857 | 0.026 | 0.037 | 0.090 | 0.117 | 0.041 | 0.097 | 0.116 | |||
| 1.4 | True | 0.093 | 0.084 | 0.105 | 0.344 | 0.494 | 0.084 | 0.347 | 0.488 | 0.081 | ||
| Fake | 0.862 | 0.813 | 0.037 | 0.044 | 0.074 | 0.123 | 0.039 | 0.069 | 0.120 | |||
| 1.6 | True | 0.248 | 0.224 | 0.102 | 0.656 | 0.709 | 0.052 | 0.662 | 0.712 | 0.055 | ||
| Fake | 0.760 | 0.725 | 0.055 | 0.034 | 0.043 | 0.114 | 0.036 | 0.046 | 0.117 | |||
| 1.8 | True | 0.453 | 0.424 | 0.084 | 0.869 | 0.841 | 0.029 | 0.890 | 0.853 | 0.026 | ||
| Fake | 0.588 | 0.553 | 0.072 | 0.021 | 0.023 | 0.105 | 0.015 | 0.018 | 0.104 | |||
| 2.0 | True | 0.676 | 0.642 | 0.054 | 0.948 | 0.898 | 0.018 | 0.977 | 0.929 | 0.013 | ||
| Fake | 0.366 | 0.345 | 0.081 | 0.008 | 0.011 | 0.100 | 0.006 | 0.007 | 0.099 | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData-Driven Disease Surveillance · Spatial and Panel Data Analysis · Nutritional Studies and Diet
A functional-model-adjusted spatial scan statistic
Mohamed-Salem Ahmed
Michaël Genin
Abstract
This paper introduces a new spatial scan statistic designed to adjust cluster detection for longitudinal confounding factors indexed in space. The functional-model-adjusted statistic was developed using generalized functional linear models in which longitudinal confounding factors were considered to be functional covariates. A general framework was developed for application to various probability models. Application to a Poisson model showed that the new method is equivalent to a conventional spatial scan statistic that adjusts the underlying population for covariates. In a simulation study with univariate and multivariate models, we found that our new method adjusts the cluster detection procedure more accurately than other methods. Use of the new spatial scan statistic was illustrated by analysing data on premature mortality in France over the period from 1998 to 2013, with the quarterly unemployment rate as a longitudinal confounding factor.
Keywords: cluster detection, confounding factor, functional data analysis, longitudinal data, generalized functional linear model.
1 Introduction
In many fields of science, cluster detection methods are useful tools for objective identifying aggregations of events in time and/or space and for determining the latter’s statistical significance. In the field of epidemiology, researchers often seek to detect spatial clusters in which the risk of disease is significantly higher or lower than in the rest of the geographical area studied. For diseases of unknown etiology, information on the presence and nature of clusters provides clues to the disease mechanism (especially in terms of environmental factors), and can facilitate the design of subsequent individual-level observational studies.
Over the last few decades, several cluster detection methods have been developed. In particular, spatial scan statistics (originally proposed by Kulldorff, on the basis of Bernoulli and Poisson models (Kulldorff, 1997, 1999)) are powerful methods for detecting spatial clusters with a variable scanning window size and in the absence of pre-selection bias, and then testing the clusters’ statistical significance. Following on from Kulldorff’s initial work, several researchers have adapted spatial scan statistics to other spatial data distributions, such as ordinal (Jung et al., 2007), normal (Kulldorff et al., 2009), exponential (Huang et al., 2007) and Weibull model (Bhatt & Tiwari, 2014). Spatial scan statistics have been extended to the multivariate framework by Kulldorff et al. (2007), Neill (2012), and, most recently, Cucala et al. (2017, 2018).
One of the main problems in cluster detection is the need to adjust for covariates. If a covariate is a confounding factor associated with the event of interest, and is not homogeneously distributed over a geographical area, a cluster analysis can generate clusters in which the covariate (and not the event of interest) predominates. For example, clusters of cardiovascular disease must be adjusted for social deprivation, which is a strong confounding factor (Rothman et al., 2008). In the absence of adjustment, the analysis may highlight very deprived areas that have a higher number of disease cases but are not epidemiologically relevant because of confusion bias. In the literature, several covariate adjustment techniques have been applied to spatial scan statistics. For the Poisson model, Kulldorff et al. (1997) originally suggested the use of (i) indirect standardization methods to adjust for qualitative covariates, and (ii) regression methods to adjust for quantitative covariates and to estimate the expected number of cases per spatial unit. For the Bernoulli model, Kulldorff et al. (2007) suggested using several datasets for each stratum of a qualitative covariate. Klassen et al. (2005) applied multilevel regression methods to adjust for quantitative covariates. More recently, Jung (Jung, 2009) used generalized multivariate linear models (GMLMs) to build spatial scan statistics that incorporated covariates. The latter approach is particularly valuable because it merges spatial scan statistics developed for different probability models into a single framework. However, this approach has limitations when dealing with longitudinal covariates. In a purely spatial analysis, there are two possible scenarios for longitudinal data: (i) the variable outcome and the covariates are observed on the same time scale (e.g. one observation per year for both) over a long period of time, or (ii) the variable outcome and the covariates are observed on different time scales (e.g. one observation per year for the outcome, and one observation per month for covariates). In Jung’s approach, a simplistic way of managing longitudinal covariates in both scenarios is to summarize the data by averaging them (or determining the median) over the entire time period. However, this may lead to significant information loss and a decrease in the quality of covariate adjustment. Alternatively, the confounding factors for each measurement time scale can be included in the model as a covariate, as long as the time scale is the same for each of the spatial units (in order to limit the number of missing values). However, this approach may create a high-dimensional vector of coefficients and introduce multicollinearity (James, 2002).
In the present work, we developed a spatial scan statistic based on functional data analysis (FDA) (Ramsay & Silverman, 2005). Firstly, our approach allows longitudinal data to be considered as the realization of a random function over an interval containing discrete time points. It should be noted that the random function can be observed at different, unequally spaced time points for each location. Secondly, our approach replaces the high-dimensional vector of coefficients by a parameter function to be estimated. These two characteristics make it possible to overcome both the above-mentioned problems, i.e. identical measurement times, and high dimensionality.
The present article is organized as follows. Section 2 describes the methodological aspects of the functional-model-adjusted spatial scan statistic (FMASSS). In Section 3, the FMASSS was applied to a Poisson model, and was found to be equivalent to a conventional spatial scan statistic when the underlying population was adjusted for covariates. Section 4 presents both the design and the results of a simulation study. Section 5 describes the application of the FMASSS to epidemiologic data and the detection of clusters of high and low premature mortality in France. Lastly, the results are discussed in Section 6.
2 Functional-model-adjusted spatial scan statistic
Let consider that at each location (one of different spatial locations included in )), we observe an outcome variable and two type of covariate: is a random vector and is the realization of a real-valued stochastic process at time points (i.e. longitudinal data). Hereafter, all observations are considered to be independent, this is a classical assumption in scan statistics. A spatial scan statistic usually denotes the maximum concentration observed among a collection of potential clusters denoted by . It is used as a test statistic for areas in which the concentration might be abnormally high or abnormally low Cressie (1977). Kulldorff Kulldorff (1997) introduced a spatial scan statistic based on a generalized likelihood ratio; this enables the comparison of concentrations in potential clusters of different sizes, and takes account of heterogeneity in the underlying population. Without loss of generality and in line with Kulldorff’s work (Kulldorff, 1997), we shall focus on variable-size, circular clusters. Hence, the set of potential clusters is built so that (i) each potential cluster is centered at a particular location, and (ii) the radius is limited so that the corresponding cluster cannot cover more than 50% of the studied region. It should be noted that many other configurations (such as elliptical clusters (Kulldorff et al., 2006) and graph-based clusters (Cucala et al., 2013) have been suggested.
Conventionally, the spatial scan statistic can be defined as the potential cluster that maximizes a log-likelihood ratio (LLR) over namely the most likely cluster (MLC). This LLR is based on a null hypothesis (the absence of a cluster) and an alternative hypothesis (the presence of a cluster). If confounding covariates ( and ) are present, the MLC can be revealed by these factors alone. Thus, the spatial scan statistic has to be adjusted with respect to these covariates. In Jung’s GMLM approach (Jung, 2009), and will be integrated as separate covariates. However, as mentioned in the Introduction, this approach can be limited by information loss and high dimensionality. Hence, we developed an FMASSS that considers as realizations of a random function , where is an interval containing the discrete time points. The random function is approximated from the longitudinal observations . More generally, a basis of functions is considered with , and the random function is assumed to belong to the space generated by this basis
[TABLE]
where the matrix basis coefficients with elements can be estimated using either an interpolation method (if the measurements are observed without error, i.e , or an ordinary (or penalized) least-square method (if the measurements are observed with some error, i.e . The choice of the basis of functions depends on the shape of the longitudinal data. For instance, a B-spline basis is the most suitable choice for non-periodic functional data, a Fourier basis can be useful for periodic functional data, while a wavelet basis can be appropriate for functional data with discontinuities or changes in behavior (see Ramsay & Silverman (2005) for more details).
Once the random function has been built for each location si, one can use the generalized functional linear modelMüller & Stadtmüller (2005) to adjust the spatial scan statistic with respect to the covariate and the random function . To this end, let and assume that the conditional mean of the outcome variable , (with respect to the covariate information and the potential cluster) is defined by the following revised generalized functional linear model:
[TABLE]
where is a binary covariate equal to 1 if the location belongs to and equal to 0 otherwise, and where is a known increasing link function. The parameters of interest are the intercept , which refers to the intensity of the cluster, the coefficients associated with the vector of covariates , and the parameter function , which is a smoothing function that can be considered as a generalization of a slope function.The parameters and are fixed inside and outside the potential cluster, which means that the distributions of the covariates and are invariant with respect to the clustering hypotheses. In other words, the conditional mean of inside is fully characterized by its intensity . It should be noted that can be interpreted as the covariate-adjusted relative risk for individuals within the potential cluster , relative to the risk for those outside it. The clustering hypotheses can therefore be expressed as follows:
[TABLE]
Given that is an increasing function, means that the mean of inside is higher (or lower) than the mean of outside .
As mentioned above, the spatial scan statistic is based on the likelihood ratio between these two hypotheses. Thus, in order to provide a general framework that can handle various models (Bernoulli, normal, Poisson, etc.), one needs to assume that the outcome variable has a known, parametrized, conditional log-likelihood function:
[TABLE]
where is a positive function defining the variance of .
Below, we describe the estimation procedure under each hypothesis and then introduce the FMASSS.
Estimation under the null hypothesis. Under the null hypothesis, model (2) is reduced to a GLFM:
[TABLE]
We used the popular estimation procedure developed by Müller and StadtmüllerMüller & Stadtmüller (2005). It is based on a truncation strategy in which the random function and the parameter function are projected into a space of functions generated by a basis of functions with an arbitrary dimension. Let be the eigenbasis associated with the functional principal component analysis (PCA) of the functional data . For a fixed , the parameter function is approximated by its projection in the space of functions generated by the first eigenfunctions:
[TABLE]
Using this approach, Müller & Stadtmüller (2005) suggested that the conditional mean (4) could be approximated by its truncated version :
[TABLE]
where and is the coefficient vector of the random function in the eigenbasis, which is given by:
[TABLE]
Using (5), we defined the following truncated log-likelihood function under
[TABLE]
It should be noted that (6) is a log-likelihood function associated with a GMLM whose covariates are and , where , and are the maximum likelihood estimators (MLEs) of and respectively. Consequently, the MLE of the parameter function is given by:
[TABLE]
The quality of the estimation depends principally on , i.e. the number of eigenfunctions used in the truncation strategy. This crucial parameter can be consistently chosen by inspecting the Akaike information criterion (AIC) related to (6), as proved by Müller & Stadtmüller (2005). Note that we used a pre-selected based on the cumulative inertia. Indeed, we focused on the selection of a (using the AIC) with a cumulative inertia value below a given threshold (95% in the present case)Ahmed et al. (2018).
Estimation under the alternative hypothesis. Since the parameters and must be independent of the potential cluster, their estimates under will be fixed in the alternative hypothesis . This means that under , covariate effects are invariant inside and outside the potential cluster. Hence, one only needs to estimate the parameters and for each . This can be achieved by maximizing the following log-likelihood function with respect to the two scalars:
[TABLE]
with
[TABLE]
Let us consider and , denoting the MLEs of and , respectively. It should be noted that the covariate information is added as an offset, which illustrates the above-mentioned assumption concerning the independence of the potential cluster vs. the covariates.
Functional-model-adjusted spatial scan statistic. Using the MLEs determined under the two hypotheses, the LLR can be defined as follows:
[TABLE]
The MLC is then defined as the potential cluster that maximizes this ratio:
[TABLE]
Hence, the FMASSS is defined as the LLR associated with the MLC:
[TABLE]
Since the distribution of under does not have a closed form, the significance of the MLC is evaluated by Monte-Carlo simulation. Each simulation combines the real data (associated with the covariates) with a random dataset generated for the outcome variable. The latter is simulated using a conditional distribution under (via and ). Let denote the observations of the FMASSS on the simulated datasets. According to DwassDwass (1957), the p-value of the FMASSS observed in the real data is defined by , where is the rank of in the -sample .
The FMASSS is built in three steps:
Construction of functional data, and dimension reduction
- •
Construct the functional data by using a suitable basis of functions .
- •
Apply a functional PCA to the constructed functions . This is equivalent to a multiple PCA on the matrix where is the matrix with elements (Escabias et al., 2004)
[TABLE]
Thus, the eigenfunctions are defined by where are the elements of the matrix , where is the eigenvector matrix associated with a multiple PCA of the matrix . Moreover, the coefficients of the functional data in the eigenbasis are given by the matrix .
- •
Choose the optimal truncation parameter ,i.e. one that minimizes the AIC associated with models with the log-likelihood function defined in (6).
Computation of the observed FMASSS
- •
Use to estimate , and under by using the log-likelihood function defined in (6).
- •
For each potential cluster , find and , that maximize (7) by adding the and as offsets, then calculate the associated . Moreover, identify the MLC and its FMASSS , over the set of potential clusters.
Monte-Carlo simulation
- •
Apply the Monte-Carlo hypothesis testing procedure described above.
3 Application to a Poisson model
This section describes the estimation procedure when the data on the outcome variable have a Poisson distribution. Let be the measurement of the underlying at-risk population associated with the th location . The Poisson model is characterized by the following link function and the conditional log-likelihood (3):
[TABLE]
with
[TABLE]
It should be noted that multiplication by makes it possible to take account of the underlying at-risk population as an adjustment covariate. Consequently, is taken as an offset in the model. Let , and be the MLEs under the null hypothesis. It can be shown that the MLE is expressed in the following manner (for details, see the Appendix):
[TABLE]
Note that the can be viewed as the incidence rate under in the adjusted underlying at-risk population rather than in the initial underlying at-risk population .
As detailed in section 2, the estimation procedure under consists in maximizing the log-likelihood (7), which is expressed as follows:
[TABLE]
taking its maximum at:
[TABLE]
It should be noted that is the relative risk associated with the potential cluster after adjusting for the underlying at-risk population .
Next, is given by:
[TABLE]
where
[TABLE]
It should be noted that (12) is equivalent to the LLR proposed by Kulldorff (1997) for a Poisson model, except that the adjusted underlying at-risk population is taken into account (rather than ). In other words, adjustment for covariates is equivalent to considering a Poisson model with an underlying at-risk population adjusted under the null hypothesis.
4 Simulation study
We simulated a cluster detection procedure in order to compare the quality of adjustment for a longitudinal confounding factor in three spatial scan statistic models: a univariate model, a multivariate model, and a functional model.
4.1 Design of the simulation
Artificial datasets were generated according to Poisson models by using the geographic locations of the French administrative areas (départements, as shown in Figure 6 in the Supplementary Material) and population data from the French national census database (Institut National de la Statistique et des Etudes Economiques, INSEE). Each location was defined as the département’s administrative center. Two types of non-overlapping cluster (each containing 8 départements) were defined and simulated for each artificial dataset. The first was entirely characterized by the cluster intensity , namely the true cluster (the areas in green in Figure 6), and the second was characterized solely by the effect of the functional covariate, namely the fake cluster (the areas in red in Figure 6).
Generation of the artificial datasets. The random functions were simulated as the realization of the following process in the interval :
[TABLE]
where , is uniform, and are uncorrelated, normally distributed random variables. A total of 94 curves were simulated with respect to the random function (Figure 1, left panel) and used to generate data from the following Poisson model:
[TABLE]
where corresponds to the at-risk population in the th département and for départements located in the true cluster. Firstly, an intercept was chosen to ensure a disease incidence of approximately in the absence of a cluster and the absence of a confounding covariate. Secondly, the confounding functional covariate was introduced into the model using in such a way that the mean value of the outcome was twice as high inside the fake cluster as outside. Thirdly, different values of the true cluster intensity were considered and expressed in terms of the relative risk: .
Comparison of three models. To illustrate the performance of the functional approach to adjustment, we compared three models. We considered that for each location, the functional covariate was only observed, at 70 time points equally spaced throughout the interval . Below, the term ”longitudinal data” refers to the realization of the functional covariate at these 70 time points.
In the univariate model, the outcome variable was adjusted by a single covariate (the average of the longitudinal data). In the multivariate model, the outcome variable was adjusted by 70 random covariates with the values of the 70 time points by using Jung’s methodJung (2009). In order to deal with the strong collinearity between these 70 covariates, a multiple PCA was applied by using the AIC-based selection method described in Section 2. Lastly, in the functional model, the outcome variable was adjusted by using the smoothed curves as a functional covariate. The latter was constructed from the longitudinal data by using a cubic B-spline basis of functions, as defined by 13 equally spaced knots in the interval (the right panel in Figure 1).
For each value of the cluster intensity, artificial datasets were simulated. The three models were compared with regard to three distinct criteria: the power to detect a significant cluster (true or fake), the true-positive (TP) rate, and the false-positive (FP) rate. The power of each model was defined as the proportion of datasets highlighting a significant cluster (a true or fake cluster), with a type I error of and Monte-Carlo simulations. The TP and FP rates were calculated according to Cucala et al.’s methodCucala et al. (2018).
4.2 Results of the simulation study
The results of the simulation study are shown in Figure 2 (see Table 2 in the supplementary material for more details). The adjustment based on a univariate model (with the average of the longitudinal data as a covariate) failed to detect the true cluster as the MLC. This was particularly the case for cluster intensity values that were low or moderate, relative to the intensity of the fake cluster. The univariate model detected the fake cluster as the MLC, as illustrated by the curves for the power and the TP and FP rates in Figure 2. The adjustments based on the functional and multivariate models did not differ significantly with regard to the power or the TP and FP rates for detecting the true cluster. The functional model performed slightly better for high cluster intensities (). As expected, the power of both models increased with the cluster intensity. It can be seen that both the multivariate model and the functional model seldom detected the fake cluster as the MLC.
5 Application to epidemiologic data
5.1 Premature mortality and related confounding factors
We considered data provided by the INSEE on premature mortality in France between 1998 and 2013. Premature mortality was defined as death before the age of 65. For each of the 94 French départements (administrative areas) and for the period between 1998 and 2013, the mean premature mortality rate was defined as the number of persons who died before the age of 65, divided by the mean number of persons aged under 65. Hereafter, the outcome variable refers to the number of premature deaths per département between 1998 and 2013. The spatial distribution of premature mortality in France is shown in Figure 7 (supplementary materials).
It is known that premature mortality affects men more than women, and is correlated with socio-economic status: the most deprived are more likely to die youngStringhini et al. (2017). Thus, it is important to adjust the spatial cluster detection analyzes for the confounding factors of gender and socio-economic status. To this end, we considered the mean proportion of men aged under 65 over the period from 1998 to 2013 for each département (as provided by the INSEE database). We chose the mean proportion because it did not greatly vary over the 16-year period (see Figure 8 in the supplementary materials). We considered the unemployment rate (in %) for each quarter of the period from 1998 to 2013 as a proxy for socioeconomic status - leading to 64 values per département. Figure 3 shows both the spatial distribution of the mean unemployment rate over the entire period and the change over time in the unemployment rate for each of the départements. The mean unemployment rate is spatially heterogeneous. Furthermore, the unemployment rate varied markedly between 1998 and 2013, and thus must be considered as a longitudinal confounding factor.
5.2 Spatial clusters detection
In order to detect spatial clusters of premature mortality, four different Poisson models were considered. Each model was adjusted for gender by introducing the mean proportion of men by département over the period from 1998 to 2013 as a covariate. The four models are described below:
Model 1 (the non-adjusted model): no adjustment of the outcome variable for the unemployment rate. 2. 2.
Model 2 (the univariate model): adjustment of the outcome variable for the unemployment rate, using the mean rate over the period from 1998 to 2013 by département as a single covariate. 3. 3.
Model 3 (the multivariate model): adjustment of the variable outcome for the unemployment rate by considering the each of the quarterly values by département for the period from 1998 to 2013 as a covariate. Thus, 64 covariates related to the unemployment rate were introduced into the model. 4. 4.
Model 4 (the functional model): adjustment of the outcome variable for the unemployment rate using smoothed rate curves as a functional covariate. The curves were built from the data using a cubic B-spline basis defined by 15 knots in the interval . The least-squares method was used to compute the corresponding coefficients for each random curve.
Each model was used to detect spatial clusters with a high-risk of premature mortality (i.e. with a relative risk (RR) ) or with a low risk of premature mortality (RR ). The MLC was considered, together with secondary clusters that had a high FMASSS value and did not cover the MLC Kulldorff (1997). The statistically significance of the detected spatial clusters was evaluated by performing 999 Monte-Carlo simulations, with a type I error of 0.05.
5.3 Results
The statistically significant spatial clusters detected by the non-adjusted and univariate models (models 1 and 2) are presented in Figure 4, and those identified by the multivariate and the functional models (models 3 and 4) are displayed in Figure 5. Detailed information on the spatial clusters is presented in Table 1.
Model 1 identified 6 significant spatial clusters of premature mortality: 3 low-risk clusters (RR: 0.79 to 0.86) and 3 high-risk clusters (RR: 1.18 to 1.28) (top panel in Figure 4). The MLC (Cluster 1, RR=1.28) was located in northern France, and was characterized by a high unemployment rate. Similarly, the first secondary cluster (Cluster 2, RR=0.79) was located in eastern France and was characterized by a low unemployment rate.
Model 2 also identified 6 significant spatial clusters of premature mortality: 3 low-risk clusters (RR: 0.77 to 0.86) and 3 high-risk clusters (RR: 1.08 to 1.20) (bottom panel in Figure 4). Like model 1, model 2 also detected the cluster with a high unemployment rate in northern France (Cluster 6, RR: 1.08) and the cluster with a low unemployment rate in eastern France (Cluster 2, RR=0.86) - emphasizing the poor quality of adjustment when using solely the mean unemployment rate over the study period.
Model 3 detected 3 statistically significant spatial clusters of premature mortality: a low-risk cluster (RR: 0.86) and 2 high-risk clusters (RR: 1.17 and 1.19, respectively) (top panel in Figure 5). It should be noted that the clusters characterized by a high or low unemployment rate (in northern and eastern France, respectively) detected by models 1 and 2 were not detected by model 3. The MLC in model 3 (Cluster 1; RR: 0.86) highlighted significant heterogeneity in the unemployment rates because it included a département with a high unemployment rate and a département with a low unemployment rate.
Model 4 highlighted 3 statistically significant spatial clusters of premature mortality: 1 low-risk cluster (RR: 0.74) and 2 high-risk clusters (RR: 1.21 and 1.24, respectively) (bottom panel in Figure 5). These three clusters are characterized by unemployment rate curves close to the average curve (in green). This result shows that the cluster detection was well adjusted for the unemployment rate.
6 Discussion
Here, we developed an FMASSS in order to adjust cluster detection for longitudinal confounding factors in a purely spatial analysis. In other words, we addressed the issue of adjusting a spatial scan statistic for repeatedly measured covariates whose values vary over time. The FMASSS was derived by modeling the longitudinal confounding factor as a random function. The corresponding basis of functions depends principally on the nature of the longitudinal data. One advantage of using a random function is its consideration of the entire set of longitudinal data, rather than a rough approximation by a statistical indicator such as the mean (which is often the case in spatial epidemiological studies). Furthermore, this functional approach makes it possible to overcome (i) the missing data problem related to the difference in measurement times between spatial units, and (ii) the high dimensionality inherently associated with multivariate approaches when longitudinal data are measured at many time points. Our approach was built into a general framework for use with various parametric models (Bernoulli, Gaussian, and Poisson models, etc.). For a Poisson model, it has been shown that the FMASSS is equivalent to Kulldorff’s classical spatial scan statistic in an adjusted population (Kulldorff, 1997).
We next simulated and compared different way of adjusting the spatial scan statistics for longitudinal confounders. The univariate model did not adjust the data well, and detected a fake cluster when the cluster intensity was weak or moderate. In contrast, the multivariate and functional models were both able detect the true cluster with a high power. The functional model was slightly better than the multivariate model for high cluster intensities. It should be noted that this general power equivalence for the two latter models is partly due to the design of the simulation study. In fact, the simulation represented an ideal situation because the measurement times for the longitudinal data were the same in all the spatial units; hence, there were no missing data in the multivariate model.
These models were applied to the detection of spatial clusters of premature mortality in France over the period from 1998 to 2013. The proportion of men by département and the unemployment rates for each quarter of the study period (64 values per département) were considered as confounding variables. The clusters considered to be significant in the univariate model (based on the mean unemployment rate over the entire study period) were characterized by unemployment rates that were far from the mean. This finding highlighted the univariate model’s poor ability to adjust for a longitudinal confounding factor summarized as the mean. In the multivariate model, the MLC also included départements with unemployment rates that were far from the mean value - again showing that the adjustment was not optimal. In contrast, the spatial clusters of premature mortality detected by the functional model had unemployment rates that were very close to the mean - testifying to high-quality adjustment for the longitudinal confounding factor. In the present application, it would have been interesting to adjust to environmental factors that are usually measured daily or weekly. The new method presented here is very well suited to this type of longitudinal data.
It should be borne in mind that the new FMASSS deals with round-shaped clusters only (the simplest case). However, clusters may be elongated in some situations - such as the aggregation of cases of water-borne disease along a river. However, the FMASSS can easily be extended to other spatial cluster shapes, such as elliptical clusters (Kulldorff et al., 2006), graph-based clusters (Cucala et al., 2013) or (for a spatiotemporal framework) cylindrical clusters (Kulldorff et al., 2005).
Lastly, the FMASSS can be extended to spatiotemporal frameworks in which the outcome measure and longitudinal confounders are measured on different time scales (e.g. an outcome measured annually and a longitudinal confounding factor measured monthly). In this context, the longitudinal data can be represented by a random function for each of the outcome time units.
Appendix A An explicit intercept estimator in the Poisson model under
Under the null hypothesis, the truncated log-likelihood function (6) associated with the Poisson model (11) is given by:
[TABLE]
and has the following first partially derivative with respect to :
[TABLE]
It should be borne in mind that the MLEs , and of , and , respectively, have to satisfy the first-order condition:
[TABLE]
Therefore, the coefficient has an explicit expression with respect to the other coefficients and
[TABLE]
Appendix B Supplementary material
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ahmed et al. (2018) Ahmed, M., Attouch, M., & Dabo-Niang, S. (2018). Binary functional linear models under choice-based sampling. Econometrics and statistics , 7 , 134–152.
- 2Bhatt & Tiwari (2014) Bhatt, V., & Tiwari, N. (2014). A spatial scan statistic for survival data based on weibull distribution. Statistics in medicine , 33 , 1867–1876.
- 3Cressie (1977) Cressie, N. (1977). On some properties of the scan statistic on the circle and the line. Journal of Applied Probability , 14 , 272–283. URL: http://www.jstor.org/stable/3212998 .
- 4Cucala et al. (2013) Cucala, L., Demattei, C., Lopes, P., & Ribeiro, A. (2013). A spatial scan statistic for case event data based on connected components. Computational Statistics , 28 , 357–369.
- 5Cucala et al. (2017) Cucala, L., Genin, M., Lanier, C., & Occelli, F. (2017). A multivariate gaussian scan statistic forspatial data. Spatial Statistics , 21 , 66–74.
- 6Cucala et al. (2018) Cucala, L., Genin, M., Occelli, F., & Soula, J. (2018). A multivariate nonparametric scan statistic for spatial data. Spatial Statistics , 29 , 1–14.
- 7Dwass (1957) Dwass, M. (1957). Modified randomization tests for nonparametric hypotheses. The Annals of Mathematical Statistics , (pp. 181–187).
- 8Escabias et al. (2004) Escabias, M., Aguilera, A. M., & Valderrama, M. J. (2004). Principal component estimation of functional logistic regression: discussion of two different approaches. Journal of Nonparametric Statistics , 16 , 365–384.
