Climate-driven statistical models as effective predictors of local dengue incidence in Costa Rica: A Generalized Additive Model and Random Forest approach
Paola V\'asquez, Antonio Lor\'ia, Fabio Sanchez, Luis A. Barboza

TL;DR
This study develops climate-based statistical models using GAM and Random Forest to predict dengue incidence in Costa Rica's diverse micro-climates, aiding targeted public health interventions.
Contribution
It introduces a combined GAM and Random Forest modeling approach for predicting dengue risk based on climate data in Costa Rica.
Findings
Models successfully predicted dengue risk in different municipalities.
Climate variables significantly influence dengue incidence.
The approach offers a tool for proactive public health planning.
Abstract
Climate has been an important factor in shaping the distribution and incidence of dengue cases in tropical and subtropical countries. In Costa Rica, a tropical country with distinctive micro-climates, dengue has been endemic since its introduction in 1993, inflicting substantial economic, social, and public health repercussions. Using the number of dengue reported cases and climate data from 2007-2017, we fitted a prediction model applying a Generalized Additive Model (GAM) and Random Forest (RF) approach, which allowed us to retrospectively predict the relative risk of dengue in five climatological diverse municipalities around the country.
| Santa Cruz | Liberia | Limón | Buenos Aires | Alajuela | |
|---|---|---|---|---|---|
| Humidity | 5 | 6 | 7 | 7 | 10 |
| log(precipitation) | 7 | 7 | 17 | 3 | 10 |
| Mean Temp. | 29 | 27 | 4 | 19 | 25 |
| SSTA | 27 | 28 | 14 | 27 | 22 |
| GAM | RF | |
|---|---|---|
| Alajuela | 0.53 | 0.54 |
| BuenosAires | 1.74 | 2.03 |
| Liberia | 2.43 | 2.36 |
| Limon | 1.21 | 1.24 |
| SantaCruz | 1.13 | 1.14 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Climate-driven statistical models as effective predictors of local dengue incidence in Costa Rica: A Generalized Additive Model and Random Forest approach
Abstract
Climate has been an important factor in shaping the distribution and incidence of dengue cases in tropical and subtropical countries. In Costa Rica, a tropical country with distinctive micro-climates, dengue has been endemic since its introduction in 1993, inflicting substantial economic, social, and public health repercussions. Using the number of dengue reported cases and climate data from 2007-2017, we fitted a prediction model applying a Generalized Additive Model (GAM) and Random Forest (RF) approach, which allowed us to retrospectively predict the relative risk of dengue in five climatological diverse municipalities around the country.
Paola Vásquez111Corresponding author: [email protected]
Present address: Escuela de Salud Pública,Universidad de Costa Rica. San Pedro de Montes de Oca, San José, Costa Rica, 11501., Antonio Loría222Centro de Investigación en Matemática Pura y Aplicada (CIMPA), Escuela de Matemática, Universidad de Costa Rica. San Pedro de Montes de Oca, San José, Costa Rica, 11501. Email: [email protected], Fabio Sanchez333Centro de Investigación en Matemática Pura y Aplicada (CIMPA), Escuela de Matemática, Universidad de Costa Rica. San Pedro de Montes de Oca, San José, Costa Rica, 11501. Email: [email protected] and Luis A. Barboza444Centro de Investigación en Matemática Pura y Aplicada (CIMPA), Escuela de Matemática, Universidad de Costa Rica. San Pedro de Montes de Oca, San José, Costa Rica, 11501Email: [email protected]
1 Introduction
Dengue fever is a mosquito-borne viral infection of global significance. Currently, more than 120 tropical and subtropical countries in Africa, the Americas, and the Asia Pacific regions report endemic circulation of the dengue viruses (DENV) and their main mosquito vectors: Aedes aegypti and Aedes albopictus [6, 5] where they cause seasonal epidemics that disrupt the health and well being of the population and inflict substantial socioeconomic impact to households, health-care systems, and governments [24, 13].
In Costa Rica, as in most of the Americas, the reintroduction and dissemination of Aedes aegypti took place during the 1970s [19, 52]. However, it was until September 1993 that the first dengue cases were reported on the Pacific coast [43] when autochthonous transmission of DENV-1 was confirmed [47]. Since then, three of the four serotypes of the virus (DENV-1, DENV-2, DENV-3) have circulated the national territory, with peaks of transmission that exhibit both seasonal and inter-annual variability [43]. Over 370,000 suspected and confirmed cases have been reported by the Ministry of Health [43], of which, more than 45,000 have required hospital care [10].
With the high burden that DENV infections represent to the country, where, as in most endemic regions, traditional control measures have proven ineffective to sustain long-term trends in cases-reduction [21], surveillance, prevention, and control of dengue is a public health challenge that requires specific and cost-effective strategies [65]. In this effort, and as a worldwide strategy for reducing dengue incidence, the World Health Organization (WHO) is highlighting the importance of determining sensitive indicators for dengue outbreaks as early warning signals [65], in which climate and weather variables have shown to play an essential role [31, 18, 20]. Specifically, variables such as temperature, precipitation, humidity and El Niño Southern Oscillation (ENSO), have been closely correlated to the occurrence of dengue cases and the seasonality of dengue epidemics [16, 12, 69].
Changes in these climate conditions influence the ecology of the DENV by modulating vector mosquito population dynamics, viral replication, and transmission, as well as, human behavior [7, 48]. It has been observed that transmission of DENV occurs between 18*∘C - 34∘C, with maximal transmission peaks in the range of 26∘C - 29∘*C [46]. At higher temperatures, the duration of the life cycle decreases [72, 60], biting activity increases [62, 55, 20] and the extrinsic incubation period becomes shorter [15, 70], prolonging the infective days of the mosquito [20]. Precipitation provides habitat for the aquatic stages of the life cycle and influences vector distribution [48]. Moreover, heavy rainfall events can decrease mosquito abundance by flushing larvae from containers [36, 4], and drought events can increase household water containers [61]. Humidity also affects the biology of the mosquito as low levels of humidity have been associated with lower levels of oviposition [17] and a decreased survival rate [14]. Other studies have also associated ENSO with dengue occurrence, as El Niño and La Niña events are associated with an increased probability of droughts in some areas and excess of rainfall in other regions [23, 58, 22, 25].
The influence that these variables have on dengue transmission, and their potential use in the decision-making process, have prompted the use of numerous statistical models [56, 39, 40], which have shown promising results for the development and implementation of predictive models. Among them, Generalized Additive Models (GAM) and the Random Forest method (RF), have previously proven to be valuable tools for time series prediction analysis [71, 12, 30]. However, results vary among studies, as the complex role of local immunity patterns, public health interventions, population structure, and mobility, means that the relationship between dengue incidence and climate variables often differs across locations [49].
Given the weekly dengue data and climate information provided by the Ministry of Health and National Meteorological Institute, we analyzed the influence of temperature, precipitation, relative humidity and ENSO on the incidence of dengue infections on five climatological diverse municipalities of Costa Rica, from 2007-2017. Using a GAM and RF approach, we used the weekly climate and dengue cases information from 2007-2016 as a training set, which later allowed us, by using the observed climatological conditions, to predict the dengue cases dynamics of 2017, year that was used as a testing period.
The article is organized as follows: In Section 2, we provide details on the data and statistical methodology applied to estimate parameters, as well as the description of the model used. In Section 3, we provide the results obtained with the statistical analysis and, in Section 4, we discuss and give our conclusions.
2 Materials and Methods
2.1 Study areas
Costa Rica is a tropical country located in the Central American isthmus, between Nicaragua (north), Panamá (southeast),the Caribbean Sea (east) and the Pacific Ocean (west), administratively divided into seven provinces and 82 municipalities. With 51,100 square kilometers of land surface, the geographical location of the mountainous system, together with the trade winds, provides numerous and varied micro-climates, dividing the country into seven climatic regions: Central Valley, North Pacific, Central Pacific, South Pacific, North Caribbean, South Caribbean and North Zone, each one further divided into sub-regions [41, 35]. These multiple micro-climates have played an essential role in shaping the demographic and economic activities of the different regions, providing each one with unique characteristics [45].
Given the climatological diversity, this study was conducted in five municipalities: Santa Cruz and Liberia in the North Pacific, Buenos Aires in the South Pacific, Alajuela in the central part of the country and Limón in the Caribbean coast. Each one with different micro-climates and endemic circulation of the DENV (see Figure 1).
Liberia and Santa Cruz are located in the North Pacific climatic region, characterized by being one of the driest and warmest of the country [35]. During El Niño years, both Liberia and Santa Cruz, are prone to very extensive dry seasons and droughts, with high economic repercussions to the province [33]. After the re-emergence of the Ae. aegypti mosquito in Costa Rica, in the 1970s, Liberia was one of the first localities where the vector was identified [47], it was also, the second municipality to report dengue infections in 1993 and the first to have a case of severe dengue in 1995 [47]. From 2007-2017, Liberia reported a total of 6,685 dengue suspected and confirmed cases, while Santa Cruz had a total of 10,527 dengue cases [43]. Peaks of dengue transmission usually start at the end of May, coinciding with the beginning of the rainy season.
Buenos Aires is located in the Province of Puntarenas in the South Pacific climatic region. The climate in this municipality is characterized for being rainy with monsoon influence [35]. Despite having the adequate conditions for dengue transmission, dengue virus reached the region until 2005 [43]. From 2007-2017 a total of 4,405 cases were reported by the Ministry of Health [43], where peaks of transmission vary widely. In the Caribbean coast, Limón, has a decrease in precipitation during the months of March, September and October [42]. A total of 7,738 cases were reported during the study period [43].
Alajuela is the most urban of the study areas. As part of the Central climatic region, this municipality is characterized by a mountainous tropical climate. The Pacific influence in Alajuela, makes this a dry region, making it one of the municipalities of the province where it rains the less [34]. During the study period a total of 15,158 dengue cases were reported in Alajuela.
2.2 Data
We use two different information sources as main components in the modeling process: observed number of weekly dengue cases and climatological data.
2.2.1 Dengue Data
Data on weekly clinically suspected and confirmed dengue cases from Santa Cruz, Liberia, Limón, Alajuela and Buenos Aires, covering the period from 2007-2017 was provided by the Ministry of Health of Costa Rica. In the country, dengue is a mandatory notifiable disease, where both confirmed and probable cases are notified to the Health Surveillance Department from the Ministry of Health [44]. Confirmatory diagnosis is made to those patients that live in areas where previous cases and/or confirmed circulation of the dengue virus has not been reported[44]. Figure 2, shows the number of reported dengue cases in La Niña (blue stripe) and El Niño (red stripe) phases from 2007-2017, as well as, the relative humidity during that period.
2.2.2 Climate data
Local meteorological data from January 2007- December 2017 was provided by the National Meteorological Institute (IMN) of Costa Rica. A total of five weather stations located in the study areas were active during the eleven-year period: Santa Cruz (40 m a.s.l.), Aeropuerto Liberia Oeste (89 m a.s.l.), Aeropuerto Juan Santamaría in Alajuela (913 m a.s.l.), Aeropuerto Limón (5m a.s.l.) and Pindeco in Buenos Aires (397 m a.s.l.). These weather stations registered daily information of:
- •
Minimum, Mean and Maximum Temperature: as one of the most important abiotic environmental factors affecting the biology of mosquitoes [2], the air temperature is defined as "the temperature indicated by a thermometer exposed to the air in a place sheltered from direct solar radiation" [67] measured in *∘*C. We will denote the mean temperature as , and we used only this variable due to the large observed correlation among the minimum, mean and maximum over all the study areas.
- •
Precipitation (): is defined as the amount of water that has fallen at a given point over a specified period, expressed in millimeters (mm) [3].
- •
Relative humidity () expressed as a percentage (%), is the ratio of the actual water vapor pressure to the saturation vapor pressure with respect to water at the same temperature and pressure [67].
- •
Weekly ENSO Sea Surface Temperature () data was obtained from the Climate Prediction Center (CPC) of the NOAA. After the sea surface temperature was recognized as a key variable in ENSO [54], four regions across the Pacific equatorial belt were defined for measurements (Niño 1+2, Niño 3, Niño 3.4 and Niño 4) [50]. We included the SSTA (sea surface temperature anomalies) in the Niño 3.4 region.
Given that all the weather stations had missing observations, we used the method described by Alfaro and Soley (2009) in [1] and its corresponding implementation in Scilab software v.5.5.2, initially developed by the Institut Nationale de Recherche en Informatique et en Automatique (INRIA). The data was later re-organized to reflect weekly information to match the temporal aggregation of dengue cases data provided by the Ministry of Health. The variable precipitation received a log-transformation to reduce the effect of outlier values, and a constant was added to define the zero cases.
2.3 Model Structure and Methods
The dependent variable that we used along the article is the relative risk of the -th area with respect to the country:
[TABLE]
where everything is computed at week and it is understood as a measure of relative incidence for the -th study area. In evaluating the effects of climate variables over the incidence of vector-borne diseases, such as dengue, predictive models such as Generalized Additive Models and Random Forests have been widely used [71, 12, 38, 9]. In what follows we briefly describe both methods and how the lag information was chosen.
2.3.1 Choice of covariate lags
The overall model fit can be improved by adding lagged versions of the covariates. In this way the models can include further information from the past behavior of the variables. Following the ideas of [30] and [12], we determined the largest cross-correlation among the observed cases and each covariate and extracted its respective lag. The maximum allowed lag was taken as 30 weeks. The results are shown in Table 1 and they are used as input for the models in the next sections.
2.3.2 Generalized Additive Models
A generalized additive model (GAM model) is a generalized linear model defined as a linear combination of smooth functions of covariates [63]. Its main advantage is the flexibility on the specification of the relationship between a dependent variable and its corresponding covariates, contrary to the classical way to model that relationship based on linear associations, which is not always a good assumption in many applications. The general form of a GAM model is:
[TABLE]
where is an independent sample of observations with their respective means and distributed as a member of the exponential family [27]. The K covariates are evaluated on the smooth functions and the terms in equation (1) can also contain interactions between covariates. The functions are chosen in most cases as penalized regression splines [63]. Penalized likelihood estimation is employed to fit the parameters in GAM models [51].
For our purposes we defined the GAM model for a single study unit as follows:
[TABLE]
where we remove the subscript for convenience, the covariate , the lags are taken from Table 1, is a Gaussian error and the smooth functions are penalized cubic regression splines. The estimation process of the GAM model was performed with the R package mgcv [64].
2.3.3 Random Forests
The essential idea of Random Forest is to construct an ensemble of trees based on bootstrapping techniques and the predicted values are computed using averages over the ensemble to reduce the excess of prediction variance [8, 28]. This technique has several advantages over other boosting methods, the prediction accuracy is attained by including sequentially the covariates in order to maximize the efficiency of each tree. Besides, the computational manipulation in terms of parameter tuning is not existent [28].
For this application we used the same set of covariates and dependent variables as in equation (2). The training and prediction process was done with the R packages caret and randomForest [11, 37] with approximately 500 sample trees.
3 Results
Based on the number of dengue reported cases and weather information from 2007-2017, we fitted the prediction models described in sections 2.3.2 and 2.3.3. We took the weekly information of both the dependent variable and covariates over the period 2007-2016 as a training set for both methods and the 52 weeks of 2017 as a testing period. Both methods were also fitted using the number of weekly observed cases as a dependent variable, but we prefer to show the models fitted with the relative risk due to ease of comparison among study areas.
Figure 3 shows the results of the two different statistical models used to predict the incidence of dengue in 2017. The dotted and solid lines, correspond to the predicted relative risk of each study area over the testing period.
The predicted RR of Alajuela is quite impressive because it recovers the general decreasing trend in the observed behavior of the series and it also can capture weeks where the incidence increases suddenly. It is also interesting to note the Limón and Buenos Aires areas, where along 2017 there were some peaks of transmission, and the model was able to predict successfully the general behavior of those events within one week of precision. Santa Cruz and Liberia (both located on the Pacific Coast) were the areas with more difficulties in terms of prediction, but we still were able to predict weeks with an increasing or decreasing incidence precisely. These two study areas are particularly marked with seasonal effects that can increase the serial variance within the testing period, and hence the prediction does not perform as well.
Table 2 contains the Normalized Root-Mean-Square Error (NRMSE) of each combination of method and study area.
The NRMSE is defined as follows:
[TABLE]
where is the predicted relative risk and is the observed relative risk at week . is the mean of the observed relative risk over the testing period. We used this measure to compare the attained dispersion of the prediction with respect to its mean behavior. Note that the best prediction in terms of this measure is attained in Alajuela followed by Buenos Aires and Santa Cruz, which is relatively consistent to the conclusions of Figure 3.
4 Discussion
With the recent emergence of chikungunya and Zika, into the country, as well as, the continuous high incidence of dengue infections [43], the burden of Aedes transmitted diseases has significantly increased. In a country where resources for vector control are limited, the urgency to implement effective and affordable vector control mechanisms to complement existing ones [65] is at the forefront of public health policy in Costa Rica.
As the transmission dynamics of dengue infections are inextricably linked to the interplay of multiple meteorological conditions, recently significant advances in climate data availability, statistical modeling and information technology [66], has increasingly opened the possibility of using climate information as effective predictors of dengue incidence [26, 18, 39]. However, in Costa Rica, a country with tropical conditions optimal for mosquito survival, the extent of influence that different climate variables have on local dengue epidemiology, and the possibility of using them as early warning signals, is still in its early stages. Although different studies have been conducted [59, 22, 53], the presence of multiple micro-climates, separated by short distances, makes it relevant to advocate for more localized analyses that can take into account the specific and unique characteristics of each municipality.
In the current study, we collected weekly dengue incidence provided by the Ministry of Health, observed local temperature, precipitation and humidity from five different weather stations provided by the National Meteorological Institute, and SSTA information from 2007-2017 that could allow us to test the predictive capacity of the two selected models, Random Forest and Generalize Additive models, as well as, the level of climatological influence in the epidemiology of dengue infections in the selected municipalities.
Our analyses showed that while using the 2007-2016 period as a training set, both, the Generalized Additive Models and Random Forest performed well in predicting the temporal patterns of dengue incidence in 2017, a year that was used as a testing period. The results demonstrated that, even when the number of cases were low, as it was the case in Buenos Aires, the model accurately predicted the onset of the outbreak. However, its predictive accuracy differed depending on each region, as localities in the North Pacific coast, Liberia and Santa Cruz, the model overpredicted the number of cases. Hence, further exploration is needed to identify if in fact the model overpredicted the number of cases or there was under-reporting by the health officials in those specific regions. In a disease with such diverse and unspecific symptoms, during 2017 the laboratory responsible for coordinating the virological surveillance of arbovirus at a national level, highlighted in their annual report the low number of samples sent for dengue confirmation by municipalities in the province of Guanacaste during that year, identification that is crucial to monitor the behavior of the virus [32]. Also, other factors intrinsic to the local epidemiological dynamics are likely to play a crucial and different role for certain years among the different locations. Variables such as socio-economic conditions, human-mobility, population herd immunity for different dengue serotypes, the intensity of public health strategies, where increased control activities during certain periods of the year, such as, the beginning of the school year in Mexico [29], can significantly change the dynamics of dengue transmission, were not included in the model, therefore limiting the accuracy of prediction.
The efficacy of the model also depends on the availability of accurate climate information over the training and testing periods. In its current form, the model uses observed climatological conditions as covariate variables, limiting the prediction process on the availability of such information over the study areas. In addition, all of the weather stations presented missing information, therefore a statistical method was used to complete the series. The development of accurate climate forecasts represents a major challenge, particularly due to the low timescales in the forecasting methods of the country. Further work is in progress to explore alternative sources of local meteorological information as predictors of DENV incidence.
Despite these limitations, results from this study, suggest that large-scale climate and local weather factors can potentially be used as effective tools in the decision-making process of local public health-authorities. It also shows, as in previous studies [22, 57], the importance of statistical models as instruments in the rapid analysis of information generated by different local and national institutions, as they could enhance the management of early epidemic response and preventive measures in Costa Rica. However, the development of tailored climate products and services that can be fully mainstreamed into public health decision-making, is a collaborative process that would require inter-institutional integration of expertise and data [68], including the Ministry of Health, the National Meteorological Institute and the National Census Bureau, among others, collaboration that could have a positive impact in the management not only of mosquito-borne diseases, but all the other climate-sensitive diseases that affect the country.
Acknowledgements
We thank the Research Center in Pure and Applied Mathematics and the Mathematics Department at Universidad de Costa Rica for their support during the preparation of this manuscript. The authors gratefully acknowledge institutional support for project B8747 from an UCREA grant from the Vice Rectory for Research at Universidad de Costa Rica. We would like to thank the Ministry of Health and the National Institute of Meteorology for providing the necessary dengue incidence data and climate information. We also thank Oscar Calvo-Solano, for his help in completing the climate data. This article is part of a thesis project for the masters in Public Health at the University of Costa Rica.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] E.J. Alfaro, F.J. Soley, Descripción de dos métodos de rellenado de datos ausentes en series de tiempo meteorológicas , Revista de Matemática: Teoría y Aplicaciones, 16 , (2009), no. 1, DOI 10.15517/rmta.v 16i 1.1419.
- 2[2] B.W. Alto, D. Bettinardi, Temperature and dengue virus infection in mosquitoes: independent effects on the immature and adult stages , Am. J. Trop. Med. Hyg., 88 , (2013), no. 3, 497–505 DOI 10.4269/ajtmh.12-0421.
- 3[3] American Meteorological Society Precipitation. Glossary of Meteorology , (2012) Available from: https://www.ametsoc.org/index.cfm/ams/publications/glossary-of-meteorology/ . Accessed Feb 19, 2019.
- 4[4] C.M. Benedum, O. Seidahmed, E. Eltahir, N. Markuzon, Statistical modeling of the effect of rainfall flushing on dengue transmission in Singapore , P Lo S Negl Trop Dis, 12 , (2018), no. 12, DOI doi.org/10.1371/journal.pntd.0006935.
- 5[5] S.Bhatt, P.Gething, O.Brady, J. Messina, A.Farlow, et al., The global distribution and burden of dengue , Nature, 496 , (2013), no. 7446, 504–50, DOI 10.1038/nature 12060.
- 6[6] O.J.Brady, P.W. Gething, S. Bhatt, J.P. Messina, J.S. Brownstein, et al., Refining the global spatial limits of dengue virus transmission by evidence-based consensus , P Lo S Negl Trop Dis, 6 , (2012), no. 8, DOI 10.1371/journal.pntd.0001760.
- 7[7] O.J. Brady, N. Golding, D. Pigott, M. Kraemer, J.P. Messina, et al., Global temperature constraints on Aedes aegypti and Ae. albopictus persistence and competence for dengue virus transmission , Parasit Vectors, 7 , (2014), no.1 DOI 10.1186/1756-3305-7-338
- 8[8] L. Breiman, Random Forests , Machine Learning, 45 , (2001), no.1, 5–32 DOI 10.1023/A:1010933404324.
