Time-series modeling of epidemics in complex populations: Detecting changes in incidence volatility over time
Rachael Aber, Yanming Di, Benjamin D. Dalziel

TL;DR
This paper introduces a statistical method to detect changes in the volatility of disease incidence over time, revealing synchronized shifts in transmission patterns during the COVID-19 pandemic.
Contribution
A novel statistical framework to quantify and detect rapid shifts in incidence dispersion in time series data.
Findings
Dispersion in case counts increased during major surges like the 2022 peak, not explained by overall incidence trends.
Highly overdispersed patterns became more frequent as the pandemic progressed, suggesting transmission heterogeneity or reporting changes.
Shifts in dispersion were synchronized across U.S. counties, indicating potential common drivers of transmission.
Abstract
Trends in infectious disease incidence provide important information about epidemic dynamics and prospects for control. Higher-frequency variation around incidence trends can shed light on the processes driving epidemics in complex populations, as transmission heterogeneity, shifting landscapes of susceptibility, and fluctuations in reporting can impact the volatility of observed case counts. However, measures of temporal volatility in incidence, and how volatility changes over time, are often overlooked in population-level analyses of incidence data, which typically focus on moving averages. Here we present a statistical framework to quantify temporal changes in incidence dispersion and to detect rapid shifts in the dispersion parameter, which may signal new epidemic phases. We apply the method to COVID-19 incidence data in 144 United States (US) counties from January 1st, 2020 to…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2
Fig 3- —http://dx.doi.org/10.13039/100008227Achievement Rewards for College Scientists Foundation
- —http://dx.doi.org/10.13039/100015485Association for Computing Machinery
- —http://dx.doi.org/10.13039/100000001National Science Foundation
- —http://dx.doi.org/10.13039/100000008David and Lucile Packard Foundation
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 epidemiological studies · Data-Driven Disease Surveillance · Influenza Virus Research Studies
Introduction
Time series of infectious disease incidence appear, to varying degrees, “noisy”, showing higher frequency fluctuations (e.g., day-to-day or week-to-week fluctuations) around trends at the broader temporal ranges typical for epidemic curves (e.g., months or years). Short-term fluctuations in incidence time series are caused in part by variable reporting, but may also reflect the population-level impacts of transmission heterogeneity and changes in the landscape of susceptibility [1–8]. Metrics of variability in incidence time series may therefore carry information regarding underlying drivers of transmission, and offer a relatively unexplored avenue for understanding epidemic dynamics.
Contact tracing data has revealed temporal changes in the variability of individual reproductive numbers, quantified by shifts in the dispersion parameter of the offspring distribution in branching process models [7,8]. Similar evidence has been recovered through statistical reconstruction of transmission networks, indicating temporal trends in the level of dispersion at different phases of an epidemic [3]. However, the scaling from individual-level transmission heterogeneity to population-level epidemic dynamics is not fully understood. In addition, traditional contact tracing is very resource-intensive, and although new approaches using digital technologies may improve its speed and scalability [9], it would be helpful to have complementary population-level analyses that can estimate heterogeneity using incidence data, which is more widely available. The importance of considering population-level variability and its relationship to individual-level variability is further highlighted by the finding that a combination of individual-based and population-based strategies were required for SARS-CoV-2 control during the early phases of the pandemic in China [6]. An important challenge therefore is to develop methods that can detect changes in population-level variability in incidence time series, and to interpret these changes in terms of underlying transmission processes.
Emerging statistical techniques are leveraging variability in epidemic time series to enhance understanding of disease dynamics at the population level. For example, a recently-developed method uses population-level incidence data to estimate the dispersion parameter of the offspring distribution, which quantifies heterogeneity in secondary cases generated by an infected individual [5]. It is also possible to estimate the dispersion parameter of the offspring distribution from the distribution of the final size of a series of localized outbreaks [10]. Clustering of cases has also been estimated directly from incidence data [11]. Another important application links variability in incidence to epidemic phases; for example, changes in the mean and interannual coefficient of variation of measles incidence have been used to identify a country’s position on the path to elimination, providing insights into vaccination strategies and epidemiological dynamics [12]. Analysis of the shape of epidemic curves for influenza in cities may identify contexts where incidence is focused more intensely (proportionally more infections in a smaller span of time) with implications for the sensitivity of cities to climate forcing and for surge capacity in the health system [4,13].
What drives incidence dispersion and how does it relate to the underlying branching process of transmission, and to observations of cases? Under a wide range of configurations for a branching process model of contagion spread, the number of infected individuals It at time t will have a negative binomial distribution [14,15], , where is the expectation for It and is the dispersion parameter. The variance is related to the mean and dispersion parameters by , so smaller values of the dispersion parameter correspond to increasing amounts of dispersion, which increase the amounts by which the variance in realized number infected It exceeds the expected value, . Conversely, the distribution of It tends to a Poisson distribution (where the variance equals mean) as becomes large. The negative binomial distribution may also accurately model a time series if there is a changing process mean within a time step: for example, if the mean of a Poisson distribution itself follows a gamma distribution, the resulting distribution is negative binomial. Negative binomial regression (in contrast to Poisson regression) can account for unobserved heterogeneity, time dependence in the rate of a process and contagion within a time step that all lead to overdispersion [16].
An interpretation of the dispersion parameter for a time series model of counts is that events are times as “crowded” in time relative to a Poisson process with the same mean [17] (see derivation in S1 Text). For example, corresponds to a situation where the average number of infections in the same time step as a randomly selected case will exceed the Poisson expectation by a factor of two. In a simple example relevant to surge capacity in healthcare systems, implies that a random infectious individual visiting the emergency department at a hospital would find it on average to be twice as crowded with other infectious individuals (infected by the same pathogen) as expected for a Poisson process with the same incidence rate.
In a sufficiently large host population, and when the infectious pathogen can be assumed to spread in nonoverlapping generations, the number of infections each generation is often modeled as
where time-varying reproductive number Rt gives the expected number of secondary infections acquired from an infected individual at time t, and the generation time is set to 1 without loss of generality [14,18]. Setting arises from the assumption that individuals who acquire the infection at time t form independent lineages with identically distributed local rate parameters. In applications, this model for becomes where Ct represents reported cases and ρ_t_ the reporting rate, which relates reported cases to the true number of infections as . However, this requires that susceptible depletion in one lineage does not affect another, that transmission rates are equal across lineages, and that reporting rates do not vary across lineages.
In practice, these assumptions will not often hold, and our aim in this paper is to develop, test and apply an alternative approach which produces data-driven estimates of , including identifying timepoints when is changing rapidly, which may help to reveal the impacts of heterogeneity in transmission, susceptibility, and reporting.
Methods
By definition incidence volatility is fast relative to broadscale epidemic dynamics. Consequently, in order to estimate incidence volatility we first modeled incidence at broad spatiotemporal scales using natural splines [19]. To allow for diverse shapes in the broadscale epidemic dynamics, spline modeling was conducted within a moving window such that for each half of the window
where is the mean of the negative binomial process at time t, N represents population size, hj(t) are basis functions, the degrees of freedom is equal to the number of knots k for the natural spline, J = k + d + 1, where d is the degree of the polynomial, and are fitted parameters. The window has half-width , centered at t, i.e., extending from to . The degrees of freedom (number of knots) to be used for the splines, and the width of the moving window will depend on the application. Explanation of the specific choices we used for our application to COVID-19 cases in US counties is provided below.
Modeling the underlying epidemic dynamics based on log-transformed incidence allows us to address the statistical effects of population size on the relationship between the mean and variance in count data, which would otherwise confound our analysis. Specifically, since population size influences the mean and variance of case count data, it impacts dispersion in different-sized populations that are otherwise identical. Accordingly, population size appears as an offset in our model of broad-scale incidence changes. That is,
The form of the probability mass function (PMF) for infections at time step t is:
where is estimated via the linear predictor outlined above.
We estimate from observed incidence data using an iteratively reweighted least-squares (IRLS) procedure for mean estimation, combined with the optimize function in R, which uses a combination of golden section search and successive parabolic interpolation, to compute . Specifically, within each time window, the spline model with an offset term was used to estimate a series of values for to via IRLS, as implemented in the NBPSeq R package [20]. A single value of was then calculated for the entire time window by maximizing the likelihood function, which is based on the negative binomial probability mass function defined above.
In addition to fitting the model at each time step, we developed a likelihood-ratio test (LRT) to test the hypothesis that has changed at each time step. This test involves fitting and comparing two models: a null model (no change) and a two-part model (with a change). For the null model, a single value was fitted for the entire time window. For the -change model, separate values were fitted for the left (from to t) and right (from t to ) halves of the time window.
Very large values correspond to processes that are operationally identical to a Poisson process. Accordingly, the test does not produce a p-value if any of the three estimates exceed a user-specified threshold. In the application below, we set this threshold at 10^3^, meaning that estimates with temporal crowding within of that expected for a Poisson process were considered effectively Poisson.
Similarly, values of very close to 0 focus all of the mass of the PMF on 0, representing a scenario where the probability of observing any infections approaches zero. As with the Poisson-like tolerance described in the previous paragraph, our algorithm does not produce a p-value if any of the three estimates are below a user-specified threshold. This threshold will depend on the presence of contiguous sections of the time series being analyzed during which no cases are observed. In the application below, we set this threshold to 10^−3^, because values below this level correspond to 0 frequencies that greatly exceed those in the data.
With both upper and lower thresholds—corresponding to Poisson-like and zero tolerances, respectively—maximum likelihood estimates (MLEs) of beyond these thresholds exhibited unbounded behavior. When exceeded the upper threshold, corresponding to processes operationally identical to a Poisson process, the MLE tended to grow arbitrarily large, with the likelihood function reaching its maximum at the upper boundary of the calculated domain. Conversely, when fell below the lower threshold, representing extreme overdispersion with probability mass concentrated near zero, the MLE approached zero, and the likelihood function peaked at the lower boundary of the domain. This behavior reflects the inability of the model to reliably estimate when it lies outside the specified thresholds (Fig 1C).
Detecting dispersion changes in case count time series.a: Weekly incidence of COVID-19 in the United States, with time measured in weeks since January 4, 2020, showing an example of a randomly-selected 16-week period used as an incidence trend in simulation-based validation of the LRT test (red). b: Cases in one county (Douglas County, Nebraska) over the sample time period with estimated incidence trend (red) and estimated dispersion values on either side of the midpoint. c: Estimated θ1 versus true θ1 in simulation studies combining a randomly-selected section of the national incidence curve with a random population size and set of dispersion values. Estimated values outside of tolerance plotted in purple (close to Poisson) and blue (close to collapsing to zero), and a line with an intercept of zero and a slope of one plotted in red. d: Statistical power of the LRT test with smooth function (red line) and a 99.7% confidence interval for predicted p (red shading).
Application to simulated data
We evaluated the robustness of our framework to a range of population sizes, magnitudes of dispersion changes, and shapes of underlying incidence trends by generating 2,000 simulated epidemic curves with known parameters. Epidemic trends were modeled as smoothed incidence series derived from 16-week sections randomly selected from US COVID-19 data (described below), scaled to reflect different population sizes ranging from 10^3^ to 10^7^. For each simulated trajectory, dispersion parameters ( and ) were assigned to the two halves of the selected 16-week window, and case counts were simulated using a negative binomial distribution, where the mean ( ) was based on the smoothed incidence trend across the 144 counties scaled by the population size. The values of and were drawn from a uniform distribution spanning 10^−2^ to 10^2^, with 10% of simulations set to have no change in dispersion ( ). Extremely large differences in dispersion (absolute log-ratio of to >3) were capped by setting .
Application to empirical data
We applied our framework to COVID-19 case data for the United States at the administrative level of counties, compiled by The New York Times, based on reports from state and local health agencies between January 1, 2020, and March 23, 2023 [21], and using county population sizes estimated for 2021 from the United States Census Bureau [22]. Cumulative cases for the largest three counties in each state (the 144 counties used in the analysis) were converted to weekly counts by keeping the last observation from each week and differencing to compute new cases. Occasionally, reported cumulative case counts were not monotonically increasing due to corrections posted by local agencies as they resolved incoming data. As a result, approximately 0.24% of estimated new case counts across all counties in the dataset were negative and these were set to zero. For each county, we analyzed overlapping 16-week windows, shifting one week (i.e., one timestep) at a time. Within each window, the framework estimated the dispersion parameter ( ) using a natural spline with three degrees of freedom for each half of the window to model the broad-scale trend in incidence. Outputs included estimated dispersion parameters ( , , for the left and right halves of each window, and for the entire window), likelihood ratio test statistics, p-values for changes in dispersion at the midpoint of the window, and flags for boundary conditions such as failure to reject Poisson-like dispersion or collapse to extreme overdispersion.
Selection of window width and spline degrees of freedom
Choices of window width and degrees of freedom for the natural splines were made by comparing the accuracy of estimates using simulated data over various values of both window width and degrees of freedom. Using 16-week windows and three degrees of freedom for the splines, our method did not systematically overestimate or underestimate true dispersion in this application to COVID-19 weekly case count data.
Results
Simulations indicate that the LRT framework accurately detects changes in dispersion, with p-values converging to 0.5 as the effect size approaches 0, reflecting the uniform distribution of p-values under the null hypothesis, and decreasing toward zero as the effect size increases (Fig 1D). The framework is also robust to the range of population sizes present in the empirical data—county populations ranged from approximately 48 thousand to 9.9 million, and we tested the framework on simulated populations between 10 thousand and 10 million. Across this range, the method produced accurate estimates for within (Fig 1C), encompassing all operationally relevant values for COVID-19 incidence data and many other infectious diseases. Lower values would concentrate the probability mass function (PMF) for cases almost entirely on 0, while higher values effectively correspond to a Poisson distribution.
Applying the method to COVID-19 cases in US counties enabled investigation of changes in dispersion (Fig 1B) around the overall epidemic trajectory (Fig 1A). Periods of increased case count variability (for example, around the start of 2023 in Fig 2A) corresponded with decreases in (see the corresponding time period in Fig 2B), indicating that dispersion was dynamic. Changes in dispersion exhibited both expected and unexpected patterns of variation relative to standard theory (S1 Fig). In some instances, varied inversely with incidence It, consistent with standard epidemic theory, while in other periods, deviations from this expectation occurred, potentially signaling shifts in underlying transmission dynamics.
Dispersion analysis of weekly COVID-19 case data for Jefferson County, Alabama.Results for all counties are shown in Fig 3. a: Weekly reported COVID-19 incidence. b: Estimated dispersion parameter (θ^t) over time. c: Comparison of estimated dispersion (gray) with predicted values from the standard model θt+1=Ct/ρt, where Ct is reported cases and ρt the reporting rate at time t. Predictions are shown for fixed ρt=0.1 (black) and ρt=0.9 (blue), chosen to encompass the range of θ expected under variable ρ. d: Likelihood ratio test (LRT) statistic over time. Statistically significant changes in dispersion (red) correspond to p-values below the Bonferroni-corrected 5% threshold of a chi-square distribution with one degree of freedom.
Notably, significant changes in the dispersion parameter were observed during major epidemic transitions. For example, during the beginning of 2022, and at the end of the time series, when the pandemic was transitioning toward endemicity as the landscape of susceptibility was evolving [23]. The landscape of susceptibility was evolving as a larger proportion of cases involved reinfections. These findings underscore the complex behavior of the dispersion parameter, which not only varied with changes in case count regimes but also revealed departures from the model expectations described by Eq (1), which are consistent with changes in the underlying drivers of transmission.
As mentioned in the introduction, reporting rate and case count can be used to arrive at used in the model of incidence in Eq 1. Fig 2C displays two time series with differing assumed reporting rates, alongside the time series estimated directly from the case counts using our method. When the reporting rate is assumed to be higher at a time point, a lower would be used in the Eq 1 model.
Dispersion increased markedly around the peak in incidence during the major 2022 wave, from late December 2021 to early February 2022 (Fig 3A and 3B display averages over counties and Fig 3C and 3D display results across counties). This is in strong contrast to standard epidemic theory which predicts that dispersion should decrease as incidence rises, and suggests that the assumptions underlying the model in Eq 1 do not hold during these time periods. One potential explanation for this is that transmission rates may not be independent across lineages. For example, one high-transmission lineage may spur another as the pathogen spreads through more complex landscapes of susceptibility and transmission risk during later pandemic waves. A high concentration of low p-values around peak incidence (Fig 3F) corroborates widespread changes in across counties, reinforcing the statistical significance of this pattern. While these p-values should be corrected for multiple testing if used for inference rather than visualization, the overall trend suggests a systematic departure from theoretical expectations. Highly overdispersed patterns were also observed more frequently later in the time series (Fig 3D), pointing to increasing heterogeneity in transmission, susceptibility, and reporting during the later phases of the pandemic. In both the 2022 wave and later in the pandemic, localized surges indicated by higher dispersion may have played a larger role in pandemic dynamics than expected, including potentially placing increased demand for surge capacity on hospitals and testing centers.
Incidence and dispersion between Jan 4, 2020 and March 18, 2023, in large counties in the US.a: Mean COVID-19 cases of the 144 US counties over time (total cases over the counties divided by total population over the counties multiplied by 1,000). b: Mean log10θt of the 144 US counties over time. NAs produced by the method (see text) were removed from the average. c: log10(casest) over time for each of the 144 counties, where county is the y-axis. d: log10θt over time for each of the 144 counties. e: Expected value of log10θt under the null model, assuming a reporting rate of 0.5 for each county. f: LRT p-values over time for each county.
Discussion
Our method forms part of a larger interest in investigating variability in infections as an important attribute of epidemic time series using novel metrics. For instance, burst-tree decomposition of time series has also facilitated computation of a burst-size distribution for a series given a specified time window [24], allowing comparison of variability within one location over time.
Spatial variation in superspreading potential has been investigated through risk maps of superspreading environments [25], and future work could investigate the correspondence between dispersion in case count time series, as quantified here, and indicators of a high risk of superspreading, with the potential to further elucidate drivers of transmission risk across scales, and more finely resolve landscapes of susceptibility. Additionally, as population-wide disease control efforts may be less effective than those which are focused to individuals in high-transmission contexts [1], identifying candidate time periods when transmission heterogeneity is high may catalyze the development of more effective control strategies, particularly those that connect vulnerable populations with resources at critical times.
The finding that dispersion increased rather than decreased during the 2022 surge challenges theoretical expectations and suggests that fundamental assumptions about the scaling of transmission dynamics may require reevaluation. One hypothesis is that transmission heterogeneity could play a role in driving large surges, amplifying incidence beyond what homogeneous models predict. In particular, our finding of departures from theoretical expectations of case count dispersion indicates violations of the assumptions underlying the commonly used time-series epidemic model shown in Eq 1: lineages are likely dependent—that is, one high-transmission lineage may spur another. It was found that stratification of contacts across multiple dimensions prevents underestimation of R0 [26] and evolving contact structures have also been discussed in the literature [6], both of which could result in the observed departures. Stratification of the transmission network and its evolution is synonymous with individual-level heterogeneity in transmission and its evolution, which scales up to affect population-level dynamics [1], so variability in epidemic trajectories at the population level may provide information about individual-level variability in the transmission process (transmission heterogeneity). More specifically, it was also recently demonstrated that increasing the number of strata over which populations are organized increases R0, or doesn’t affect R0 if the additional stratification is “random” (random mixing hypothesis) [26].
Future work could investigate how bursts of highly clustered transmission events could generate feedback that accelerates epidemic spread, which, if true, could refine predictive models of contagion dynamics in complex populations. For instance, contact-tracing effort could be directed towards the candidate time periods, both to perform confirmatory analyses of transmission heterogeneity, and to increase the availability of an informative public health resource around surges.
A primary limitation of the method presented here is that careful consideration must be given to the choice of parameter values based on the user’s specific application. Specifically, the choice of window half-width must be informed by simulation of data and estimation of known . Similarly, the choice of spline degrees of freedom must optimize the accuracy of estimates. In our case, we are additionally limited by course (weekly) data.
Our results highlight some of the limitations of theoretical assumptions about the role of incidence dispersion in epidemic dynamics, such as during peak incidence periods of the COVID-19 pandemic. Models that incorporate time-varying incidence dispersion can be used to quantify the role of transmission heterogeneity in epidemic dynamics in complex populations, and can help practitioners to identify candidate time periods and locations for confirmatory analysis of superspreading, as well as for public health intervention. Our findings suggest the importance of considering nonindependent transmission rates across lineages when modeling epidemics in complex populations.
Supporting information
S1 TextDerivation of the relationship between the dispersion parameter and the mean crowding parameter.(PDF)
S1 FigSpearman’s correlation between the dispersion parameter and COVID-19 cases.A 32-week sliding window is used to compute each correlation, and 1,000 bootstrap replicates in which county labels are permuted are used to compute the 95% confidence interval.(PDF)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Lloyd-Smith JO, Schreiber SJ, Kopp PE, Getz WM. Superspreading and the effect of individual variation on disease emergence. Nature. 2005;438(7066):355–9. doi: 10.1038/nature 04153 16292310 PMC 7094981 · doi ↗ · pubmed ↗
- 2Lloyd-Smith JO. Maximum likelihood estimation of the negative binomial dispersion parameter for highly overdispersed data, with applications to infectious diseases. P Lo S One. 2007;2(2):e 180. doi: 10.1371/journal.pone.0000180 17299582 PMC 1791715 · doi ↗ · pubmed ↗
- 3Lau MSY, Dalziel BD, Funk S, Mc Clelland A, Tiffany A, Riley S, et al. Spatial and temporal dynamics of superspreading events in the 2014-2015 West Africa Ebola epidemic. Proc Natl Acad Sci U S A. 2017;114(9):2337–42. doi: 10.1073/pnas.1614595114 28193880 PMC 5338479 · doi ↗ · pubmed ↗
- 4Dalziel BD, Kissler S, Gog JR, Viboud C, Bjørnstad ON, Metcalf CJE, et al. Urbanization and humidity shape the intensity of influenza epidemics in U.S. cities. Science. 2018;362(6410):75–9. doi: 10.1126/science.aat 6030 30287659 PMC 6510303 · doi ↗ · pubmed ↗
- 5Kirkegaard JB, Sneppen K. Superspreading quantified from bursty epidemic trajectories. Sci Rep. 2021;11(1):24124. doi: 10.1038/s 41598-021-03126-w 34916534 PMC 8677763 · doi ↗ · pubmed ↗
- 6Sun K, Wang W, Gao L, Wang Y, Luo K, Ren L, et al. Transmission heterogeneities, kinetics, and controllability of SARS-Co V-2. Science. 2021;371(6526):eabe 2424. doi: 10.1126/science.abe 2424 33234698 PMC 7857413 · doi ↗ · pubmed ↗
- 7Guo Z, Zhao S, Lee SS, Hung CT, Wong NS, Chow TY, et al. A statistical framework for tracking the time-varying superspreading potential of COVID-19 epidemic. Epidemics. 2023;42:100670. doi: 10.1016/j.epidem.2023.100670 36709540 PMC 9872564 · doi ↗ · pubmed ↗
- 8Ko YK, Furuse Y, Otani K, Yamauchi M, Ninomiya K, Saito M, et al. Time-varying overdispersion of SARS-Co V-2 transmission during the periods when different variants of concern were circulating in Japan. Sci Rep. 2023;13(1):13230. doi: 10.1038/s 41598-023-38007-x 37580339 PMC 10425347 · doi ↗ · pubmed ↗
