The evolution of public health statistical modeling approaches and how to advance their incorporation into modern arboviral surveillance
Maggie McCarter, Stella C W Self, Alex Ewing, Mufaro Kanyangarara, Sarah M Gunter, Melissa S Nolan

TL;DR
This paper reviews the history of disease modeling and suggests ways to better integrate these models into public health efforts for arboviruses.
Contribution
The paper proposes strategies to incorporate modern statistical models into public health practice for arboviral surveillance.
Findings
Disease modeling has evolved but remains largely academic.
Arboviruses are often excluded from public health modeling.
Recommendations are provided for integrating models into practice.
Abstract
Statistical modeling of infectious disease transmission patterns has been in existence since the mid-1700s, evolving in their utility as the scientific and technological revolutions progressed. Despite the expansion of emerging mathematical and statistical methodologies over the past 250 yr, their usage has largely remained restricted to academic settings. This forum article will discuss the evolution of disease modeling techniques, the most common types of models in use today, and recommendations on how key archetypes can be incorporated into future public health practice. With the recent global impetus to predict and forecast novel pathogens, this article raises the question: Why are endemic arboviruses not included in public health modeling efforts, and how can medical entomologists promote their inclusion? Graphical Abstract
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1- —National Institute of Allergy and Infectious Diseases of the National Institutes of Health
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsViral Infections and Vectors · Animal Disease Management and Epidemiology · Zoonotic diseases and public health
The SARS-CoV-2 pandemic highlighted the importance of mathematical models for public health professionals, as many sought to understand this novel pathogen and its likely trajectory. During the early and highly uncertain stages of the pandemic, governments and other stakeholders heavily relied on predictive modeling to identify at-risk areas and forecast infection rates to guide decision-making. While the pandemic resulted in abundant funding for SARS-CoV-2 modeling and led to the establishment of the US Centers for Disease Control and Prevention’s Center for Forecast and Outbreak Analytics, there remains little funding for disease forecasting of important arboviruses, such as West Nile virus (WNV), dengue virus (DENV), and La Crosse encephalitis virus (LACV) (Redfield 2020, Centers for Disease Control and Prevention Justification of Estimates for Appropriation Committees Fiscal Year 2025, Centers for Disease Control and Prevention Justification of Appropriation Estimates for Appropriations Committees Fiscal Year 2026). These diseases face chronic under-attention and recurrent cyclical outbreaks. The recent mathematical modeling renaissance begs the question: Why is such an essential public health tool underutilized and underfunded for endemic diseases, such as WNV, DENV, and LACV? Further, what steps can be taken to increase the integration of these evidence-based prediction tools into routine public health planning practice? This forum article’s objective is to highlight the inequities around vector-borne disease modeling and incite meaningful conversation to create policy-level changes for integration of forecast models into public health practice for proactive, evidence-based resource allocation. This 3-part article will discuss (i) the origin and evolution of infectious disease modeling, (ii) the fundamentals of mathematical and statistical models and their specific applicability, and (iii) recommendations to promote the acceleration of modeling techniques for vector-borne diseases and adoption of arboviral predictive models among public health professionals.
History and Evolution of Infectious Disease Modeling
Infectious Disease Mathematical Modeling in the Pre-Computer Era
Although mathematics as a discipline originated in ancient times, the first paper which applied mathematical modeling to infectious disease was not published until 1766 (Bernoulli 1766). The recognition of pathogens and the molecular basis of infection were emerging concepts during the transition between the Renaissance and the Scientific Revolution, affording Bernoulli the unique opportunity to combine an established discipline (mathematics) with an emerging one (infectious disease) (Adams 2021). Smallpox ravaged Europe and North America during the 18th century wars, and efforts were being made to control the disease. Swiss mathematician Daniel Bernoulli introduced a model to assess the effect of smallpox vaccination on smallpox mortality using a formula derived from children’s life expectancy at birth, lifetime infection risk, and smallpox virulence (Bernoulli 1766, Dietz and Heesterbeek 2002, Siettos and Russo 2013). He found that immunization against the disease increased the average life expectancy of children by 3 yr. His research, along with Edward Jenner’s research on vaccinations, led to the widespread acceptance of the smallpox vaccine (Jenner 1801). However, even though Bernoulli’s introduction to infectious disease mathematical modeling proved extremely useful, these methods did not become popular until the early 1900s, when mathematical epidemiology was first officially introduced by Ronald Ross and was utilized in a study with scientist George Macdonald (R. Ross 1905, S. R. Ross 1911).
The “Ross-Macdonald” transmission dynamics mathematical model was the first recorded arbovirus transmission model (Smith et al. 2012). This model defined a reproduction number, R_0_, for malaria as a function of susceptible versus infected individuals and interactions between individuals and mosquitos. The transmissibility of a pathogen, represented by R_0_, originally designated Z_0_, still plays an important role in modern prediction modeling. An R_0_ >1 indicates increasing malaria transmission, whereas an R_0_ ≤1 indicates stable or declining malaria transmission (Baum et al. 2020, Jin et al. 2020). Pyotr Enko later built models involving dynamics between infectious and susceptible individuals; Reed and Frost further described such models. These researchers first introduced the utilization of Markov chain methods for infectious disease (EN’KO 1989, Siettos and Russo 2013). The resulting surge of mathematics’ role in infectious disease was further catalyzed by the introduction of computers in the mid-1900s (EN’KO 1989, Siettos and Russo 2013).
The Technological Revolution of Infectious Disease Modeling
The 1960s marked the proliferation of computer use in multiple disciplines. Historically, mathematical models, though advanced for their time, were limited in their scope. The widespread adoption of computer technology led to pronounced growth in infectious disease modeling, as computers allowed scientists and mathematicians to use advanced models otherwise nearly impossible to implement manually. This revolution led to the introduction of one of the more common means of fitting infectious disease models: the Markov chain Monte Carlo (MCMC) method. Although the Monte Carlo probabilistic sampling technique was first introduced during World War II, the MCMC algorithm advanced this technique in 1952 but was not widely used by statisticians until 1990 (Gelfand et al. 1990, Robert and Casella 2011). Unlike the standard Monte Carlo algorithm, which randomly samples directly from the probability distribution in question, MCMC sampling is an iterative process, whereby samples are drawn from a probability distribution, which depends on the previous set of sampled values, creating a “chain” of sampled values. Under appropriate conditions, the values in this chain converge to a sample from the target probability distribution. This method is used to predict outcomes based on existing data, which proved extremely useful in the modeling of diseases, such as HIV infections (Peterson et al. 1990, Lange et al. 1992, Wild et al. 1993). MCMC methods were increasingly used in infectious disease epidemiology to fit Bayesian regression models, as a supplement to traditional surveillance techniques (O’Neill et al. 2000).
Modern Accessibility of Infectious Disease Modeling Computational Techniques
The late 1990s and early 2000s marked the beginnings of applied practitioners’ awareness of accessible infectious disease modeling. More interpretable autoregressive integrated moving average (ARIMA) models gained popularity in infectious disease prediction among epidemiologists. In 2009, researchers proposed the integrated nested Laplace approximation (INLA) technique to further reduce computational expense in modeling (Rue et al. 2009). The INLA technique is a streamlined approach to approximating Bayesian models and can often be used instead of the standard MCMC approach. The INLA technique assumes the latent structure of a model follows a Gaussian Markov random field and assumes that observations are independent given latent effects. Rather than sampling from the joint posterior distribution of a model, as MCMC does, the INLA technique approximates the marginal posterior distributions of the model parameters, greatly reducing computational expense (Lindgren and Rue 2015).
Although models like ARIMA and fitting techniques such as INLA were being increasingly utilized in infectious disease modeling, they were primarily carried out and interpreted by highly trained statisticians. Because of this, many predictive models were often seen as theoretical rather than practical public health tools. The emergence of the SARS-CoV-2 pandemic brought a renewed light to the importance of infectious disease modeling in public health practice. Government and private institutions had newfound dependence on these models to forecast the spread of the virus (Ray et al. 2020, Nikolopoulos et al. 2021, Nixon et al. 2022). However, as implementing and interpreting such models historically required training in advanced mathematics or statistics, those in public health without such backgrounds were often poorly equipped to make use of these models. Subsequently, advanced mathematicians/statisticians who previously focused primarily on the theoretical development of these models sought out contextual knowledge of virus spread and population interactions to make these predictive models useful for interpretation and application to public health practice (Ding et al. 2021). There is, therefore, a need for the merging of specialties in the broader health arena; public health scientists would benefit from a better understanding of the process and interpretations of infectious disease modeling, whereas mathematicians/statisticians would benefit from a deeper knowledge of the ecology and transmission of pathogens.
Modeling Fundamentals and Their Specific Disease System Applicabilities
To more broadly utilize infectious disease modeling in public health, it is imperative that public health professionals understand forecasting model basics, how they are applied, and how to interpret them (Lutz et al. 2019). Fundamental to this understanding is a recognition of the differences between the types of models themselves and the different model fitting procedures.
Model Types
There are several families or classes of statistical models, which may fall into either the frequentist or the Bayesian paradigm. Many infectious disease forecasting models are regression models with distributional assumptions about the observed data. Models forecasting disease counts across time often use a Poisson or negative binomial distribution (Haight 1967). Conversely, models forecasting continuous variables such as mosquito population density frequently use linear regression or quantile regression procedures. Regression forecasting models are useful in assessing changes in model parameters and how such changes might affect the future. Such models can have a level of uncertainty in the estimates of the model parameters, in which appropriate measures of error should be accounted for (Davydenko and Fildes 2013). As spatial autocorrelation proves problematic in disease forecasting, especially in diseases such as arboviruses, spatio-temporal regression models have been developed to address such spatial autocorrelation (Legendre 1993). These models incorporate both temporal and spatial dependence in the model, allowing for area-specific disease forecasting. Multiple methods of estimating regression forecasting parameters exist; maximum likelihood estimation is the most common for frequentist models and MCMC, INLA, and maximum a posteriori estimations are popular for Bayesian models.
Spatiotemporal Models
Within the context of arboviruses, spatio-temporal regression models are heavily utilized, as researchers often seek to not only forecast these vector-borne diseases but to understand the spatial, environmental, and population features of increased disease rates in both humans and vectors. WNV and DENV are both modeled heavily by scientists exploring the relationships between disease incidence and various population, climatic, and environmental factors (Chancey et al. 2015, Marini et al. 2016, Butterworth et al. 2017, Murdock et al. 2017, Ryan et al. 2019, Maljkovic Berry et al. 2020, Keyel et al. 2021, Marinho et al. 2022, Holcomb et al. 2023, Harish et al. 2024). These relationships are not necessarily causal; nonetheless, quantifying how changes in the socio-demographic and environmental context of a disease correlate with changes in disease incidence can aid in prediction. Researchers have used Bayesian Poisson regression models, often estimated by the popular MCMC approach, to assess factors associated with disease incidence in multiple countries and found that factors, such as rainfall, temperature, and sociodemographic markers, all predicted dengue incidence (Phanitchat et al. 2019, Akter et al. 2021, Solís-Navarro et al. 2022). Other studies also utilized newer model fitting techniques such as INLA to assess factors associated with WNV incidence, with similar findings as the studies using MCMC fitting methods (Myer et al. 2017, Myer and Johnston 2019). Bayesian logistic regression models are frequently used to assess DENV vector presence/absence, as well as WNV vectors in certain areas (Hettiarachchige et al. 2018, Temple et al. 2022). Regression forecasting models are not currently used to predict LACV incidence, though recent attention given to infectious disease forecasting could instigate such research in the future (McCarter et al. 2025).
Compartmental Models
As opposed to regression models that assess model parameters and their influence on disease forecasts with distributional assumptions, compartmental models assess interactions between people, vectors, and disease dynamics, with no underlying distributional assumptions. These models are often used to predict disease spread, the number of infected individuals, and how long an epidemic is forecasted to last (Hethcote 1989). In compartmental models, individuals in a population are assigned to various “compartments”; the most basic 3 are “susceptible,” “infectious,” and “recovered.” This model, coined the SIR model, is the simplest and foundational compartmental model (Harko et al. 2014, Kröger and Schlickeiser 2020). However, as many diseases are unique in population and disease dynamics, other compartments have been added to account for these dynamics. Models such as the Susceptible Exposed Infectious Recovered (SEIR) model account for latency periods where an individual is infected but not yet infectious. Compartmental models can be used to assess the impact of hypothetical changes in transmission dynamics, such as reducing the number of susceptible individuals through vaccination or reducing the amount of contact between infectious and susceptible individuals.
Compartmental models can be paired with distributional assumptions to create either a Bayesian or frequentist statistical model or can be strictly mathematical with no uncertainty quantification. These models quantify the interactions and dynamics of each of these compartments using a system of differential (or difference) equations, which is solved to forecast disease incidence and spread. These dynamics are generally assumed to be causal; for example, an increase in the number of infected individuals will cause a subsequent increase in the number of recovered individuals. Compartmental models are heavily utilized in vector-borne disease forecasting. SIR and SEIR models have historically proven useful in predicting DENV infection rates (Newton and Reiter 1992, Esteva and Vargas 1998). Mathematical epidemiologists developed multiple models to assess WNV transmission between avian hosts, mosquito vectors, and humans (DeFelice et al. 2017, Angelou et al. 2021). A compartmental model predicting LACV in the Appalachia region of the United States assessed the dynamics in systems with A. triseriatus and A. albopictus and found that LACV transmission occurred in most models with tree-hole mosquitos as the primary vector (Bewick et al. 2016).
Agent-Based Models
Agent-based models are similar to compartmental models, but rather than dividing the population into compartments, they model each individual in the population separately (Technical Explainer: Infectious Disease Transmission Models 2025). These individuals (referred to as “agents”) are assigned a disease state (eg susceptible, infectious, or recovered) and other characteristics that may be relevant for transmission dynamics (eg age, occupation, etc.). The agents interact according to a set of prespecified rules, which govern the degree of interaction between individuals with different characteristics, the chances of infection spreading during interactions, and the duration of infection. When modeling arboviral diseases, the agents are generally a mixture of vectors, reservoir host animals and humans. Agent-based models tend to be computationally expensive and require a lot of data about the population of interest. Nevertheless, agent-based models have been successfully used to model DENV and, to a lesser extent, WNV (Jacintho et al. 2010, Nasrinpour et al. 2019, Perkins et al. 2019).
Machine-Learning Models
Machine-learning models are increasingly being utilized as high-performance computers are developed, and as surveillance datasets increase in size. Machine-learning models are a very broad class of statistical archetypes, which specify parameters for models to train based on past data and potentially improve these estimates when they encounter new future data (Mitchell and Mitchell 1997). While other models, such as regression models, make inferences on relationships between disease incidence and other cofactors, machine-learning techniques typically focus solely on the most accurate predictions possible with less emphasis on model interpretability or explainability. Like regression models, machine-learning models do not require causal relationships between the covariates and the outcome (disease incidence) but exploit the phenomenological relationships between these factors to predict disease incidence. These predictions are made by using a subset of data to “train” the model and then using the remaining data to evaluate the performance of the trained model. Many researchers recently utilized machine learning to accurately predict WNV and DENV outbreaks across the globe (Farooq et al. 2022, Nguyen et al. 2022, Roster et al. 2022, Tonks et al. 2022). Such models can be factor-based, often including environmental and climatic factors in their forecasting. However, the relationship between these factors and the disease in question is often not elucidated by a machine-learning model, and it may be difficult to ascertain how changes in these factors will influence disease incidence. Additionally, these methods are often extremely computationally expensive and require large datasets to be accurate; therefore, they are often reserved for contexts when access to high-performing computers is feasible.
Frequentist Versus Bayesian Modeling
Most infectious disease models can be categorized as either mathematical models or statistical models, with statistical models providing both estimates and uncertainty quantification (eg confidence intervals) and the mathematical models providing estimates only. Statistical models may be further classified as either frequentist or Bayesian. Frequentist statistics consider model parameters as fixed but unknown constants which are estimated from observed data. Uncertainty is typically quantified via confidence or prediction intervals derived from the sampling distribution of the model parameters; upon repeated sampling, the estimated interval captures the parameter of interest a certain pre-specified percentage of the time (often 95% is used). Within the frequentist analysis framework, researchers can perform hypothesis tests using P-values to quantify the probability of the observed event, or a “more extreme” event, occurring if the null hypothesis were true (Cox 2006). The frequentist framework is more familiar and widely used in various disciplines, as it requires fewer assumptions regarding the model parameters than its Bayesian counterpart.
Bayesian statistics derives its name from Bayes’ theorem, in which the probability of an event is directly related to previous knowledge of the parameters related to that event (Joyce 2003). For example, in forecasting WNV in the United States, models using Bayesian forecasting rely on prior established WNV knowledge such as previous infection rates to predict rates for the future (Myer and Johnston 2019). Thus, Bayesian statistics can be more useful for predicting or modeling endemic diseases versus novel or invasive vectors and their associated diseases. In the Bayesian framework, model parameters are random quantities with associated distributions; the prior distribution of the parameters reflects what is known about the parameters prior to observing any data. After observing data, Bayes theorem is used to update the prior distribution to obtain the posterior distribution. Rather than the confidence intervals produced from frequentist analyses, Bayesian inference often relies on credible intervals for model parameters. The typical interpretation of a Bayesian 95% credible interval is that given the observed data, there is a 95% probability that the parameter falls into the interval (Stone 2013, van de Schoot et al. 2021). Bayesian statistics is becoming more common in research, as it offers multiple advantages over frequentist frameworks. Bayesian analysis allows researchers to assess ranges of certainty of results within credible intervals rather than simply point estimates, as well as calculate probability distributions of said results (van de Schoot et al. 2021). Additionally, performing inference on complex models is often computationally easier and more reliable in the Bayesian paradigm.
Expanding Arboviral Forecasting for Future Public Health Interventions
Currently, more than 500 arboviruses are recognized globally, 150 of which are known pathogens to humans, and others pose unknown human disease risks (Young 2018). Such arboviral diseases can cost governments and societies billions of dollars globally through decreased tourism, impaired workforce capacity, and aggregate resident healthcare and disability costs (Thompson et al. 2020). However, even with their high prevalence and clinical and economic impact, there are very few resources allocated to arboviral surveillance, forecasting, and prevention globally; existing initiatives receive little funding from governments and institutions. For example, since 2022, approximately 10% (range 6% to 11%) of the US Centers for Disease Control and Prevention’s Emerging and Zoonotic Infectious Diseases’ budget has been allocated for arboviral prevention, response, and research (Walensky 2022, 2024, Centers for Disease Control and Prevention Justification of Estimates for Appropriation Committees Fiscal Year 2025). For fiscal year 2026, the American President requested 87,817,000 for vector-borne diseases (Centers for Disease Control and Prevention Justification of Appropriation Estimates for Appropriations Committees Fiscal Year 2026). The extent that this funding goes toward modeling is unknown; yet, the Vector-borne Disease Division does support the annual maintenance of ArboNet and TickNet, for local jurisdictions to report their human, veterinary, and vector cases of select pathogens creating a centralized database for potential modeling efforts (Centers for Disease Control and Prevention Justification of Appropriation Estimates for Appropriations Committees Fiscal Year 2026). Similarly, from January 2020 to July 2025, the National Institutes of Health’s eREPORTER database lists 102 grant funded records accumulating 9,911,490.
Another likely aspect contributing to underfunding of arboviral forecast modeling is their reliance on high-quality ecological studies and computational biology, which are often neglected in funding due to their niche focuses that misalign with federal funding agencies. For example, the US National Institutes of Health prioritizes medical research, the National Science Foundation prioritizes non-health-related scientific discovery, and the Centers for Disease Control and Prevention prioritizes public health programs. However, arboviral prediction modeling is not purely medical (it requires vector ecology inputs), not purely entomological (humans serve as hosts), and not directly applied public health work (theoretical advancements are critical in early model development), rendering this work misaligned to national funding agencies’ priorities, respectively. Promotion of this work could be facilitated by more funding for cross-agency infectious disease modeling specifically (eg the Ecology and Evolution of Infectious Diseases grant mechanism jointly sponsored by the US National Institutes of Health and National Science Foundation) or collaborative case competitions.
As research and public health funding wanes for arboviral prediction modeling, employment demands lag for a competent workforce, yielding an insidiously widening gap. The US Centers for Disease Control and Prevention recently initiated workforce development programs to mitigate medical entomology-related workforce shortages, including arboviral modelers; however, these programs are congressionally appropriated, leaving their funding subjective to current political sentiment. Historically, funding for these programs, such as the Centers of Excellence in Vector-borne Diseases, Regional Training and Evaluation Centers and Public Health Entomology for All, has had bipartisan support; however, shifting public health priorities often render the future of these workforce training programs uncertain (Senator King 2017, Senator Collins 2019, Dye-Braumuller et al. 2022).
The underfunding of such programs is also a narrative on health equity, as arboviruses disproportionately affect people living in poverty. Studies have shown populations with low socioeconomic status markers such as lower education, lower income, and poor housing are at increased arboviral infection risk (Power et al. 2022). Locations with rapidly increasing urbanization and human migration, typically associated with lower socioeconomic factors, are at the highest arboviral disease transmission risk (Tajudeen et al. 2021). These inequities in arbovirus infections highlight the need for better surveillance and forecasting methods, so stakeholders, governments, and institutions can recognize the need to allocate funds for at-risk, poverty-stricken areas. However, such forecasting methods are often expensive, requiring advanced computers and highly trained mathematicians or statisticians to run many models. Alternatives to such computationally expensive model approximations, such as the INLA method, implemented by the freely available, user-friendly R-INLA package, have great potential to make disease forecasting more accessible to areas where resources are limited (Lindgren and Rue 2015). However, the development of these methods alone cannot resolve the issue of under-surveillance and arbovirus forecasting in impoverished areas. Thus, there is a great need for better data availability in such areas to supplement these forecasting models to explore the relationships between arboviral infection and economic, population, and environmental factors. Increased knowledge of both the context of infectious disease and the application of mathematical and statistical models can bridge the gap between advanced statisticians and public health officials and thus build the foundation of resolving the problem of inequity in the exploration of arboviruses globally.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Adams DP. 2021. Foundations of infectious disease: a public health perspective. Jones & Bartless Publishers.
- 2Akter R , Hu W, Gatton M, et al 2021. Climate variability, socio-ecological factors and dengue transmission in tropical Queensland, Australia: a Bayesian spatial analysis. Environ. Res. 195:110285. 10.1016/j.envres.2020.110285.33027631 · doi ↗ · pubmed ↗
- 3Angelou A , Kioutsioukis I, Stilianakis NI. 2021. A climate-dependent spatial epidemiological model for the transmission risk of West Nile virus at local scale. One Health 13:100330. 10.1016/j.onehlt.2021.100330.34632040 PMC 8493582 · doi ↗ · pubmed ↗
- 4Baum J , Pasvol G, Carter R. 2020. The R 0 journey: from 1950 s malaria to COVID-19. Nature 582:488–488. 10.1038/d 41586-020-01882-9. · doi ↗
- 5Bernoulli D. 1766. Essai D’une nouvelle analyse de la mortalite cause par la petite verole et des a vantages de Finoculation pour la prevenir. Mem Math. & Phys. De I Acad. Roy. Sci. Hist. de I Acad. Paris 1.
- 6Bewick S , Agusto F, Calabrese JM, et al 2016. Epidemiology of La Crosse Virus Emergence, Appalachia Region, United States. Emerg. Infect. Dis. 22:1921–1929. 10.3201/eid 2211.160308.27767009 PMC 5088026 · doi ↗ · pubmed ↗
- 7Butterworth MK , Morin CW, Comrie AC. 2017. An analysis of the potential impact of climate change on dengue transmission in the Southeastern United States. Environ. Health Perspect. 125:579–585. 10.1289/ehp 218.27713106 PMC 5381975 · doi ↗ · pubmed ↗
- 8Centers for Disease Control and Prevention Justification of Appropriation Estimates for Appropriations Committees Fiscal Year 2026. Atlanta, GA: Centers for Disease, Control and Prevention.
