Model-free prognostication of non-linear time series

Xiaoyong Wu; Shesh N. Rai; Georg F. Weber

PMC · DOI:10.1371/journal.pone.0341777·February 2, 2026

Model-free prognostication of non-linear time series

Xiaoyong Wu, Shesh N. Rai, Georg F. Weber

PDF

Open Access

TL;DR

This paper introduces a model-free method for forecasting non-linear time series, such as infectious disease spread, using machine learning and time-lagged analysis.

Contribution

The novel contribution is a model-independent approach for short-term forecasting of non-linear time series using machine learning and time-lagged features.

Findings

01

Machine learning using correlation coefficients provided excellent predictions of progression using 80% of the time series as training data.

02

Feature-space plots with dynamic calibration revealed sharp spikes in the maximum local Lyapunov exponent during infectious spread peaks.

03

Average mutual information over time lags and wave lengths anticipated peaks in new infections.

Abstract

The COVID-19 pandemic has highlighted the importance of studying the course of infectious progression. Similar needs exist for time series of other origins. While models are commonly devised and fitted to the observed data, we recently demonstrated the feasibility to directly evaluate the noisy non-linear time series that characterize the occurrence. However, for practical utility, analytics alone has limited value. The requirement of forecasting – at least in the short term – needs to be met. We initially utilized normalized new infections per day (7-day moving average for cases per million inhabitants) from Our World in Data. We then validated our method in unrelated non-linear time series of stock markets and blowfly populations. We studied a novel model-independent time series approach, time lagged analyses, and feature-space plots incorporating the time-lagged data. 1) Machine…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases2

COVID-19 infections

Figures7

Click any figure to enlarge with its caption.

Fig 1 — Machine learning for model-free prognostication.The time series of normalized new infections in 4 countries were utilized in a machine learning algorithm based on autocorrelation. The black and red color lines represent the true and anticipated numbers of new cases. All values are on the scale of log plus 0.05. The sizes of the training set and the testing set were chosen to be about 80% and 20%. The tables underneath the graphs display three distinct error calculations. MAE = mean absolute error, RMSE = root mean square error, MAPE = mean absolute percentage error. Top left) Prediction of peaks for the data Germany (n = 819) based on the training set (n = 634) and the testing set (n = 185). Top right) prediction of peaks for the data Great Britain (n = 814) based on the training set (n = 592) and the testing set (n = 222). Bottom left) prediction of peaks for the data Australia (n = 819) based on the training set (n = 689) and the testing set (n = 130). Bottom right) Prediction of peaks for the data USA (n = 802) based on the training set (n = 635) and the testing set (n = 167).

Fig 2 — Validation of the machine learning for model-free prognostication.The figure provides the comparative performance results for the 5 methods. The MCML method had the best performance of all forecasting methods and generated forecasting results with the lowest MAE, MAPE, and RMSE in each case. Top left) Comparisons of time-series forecasting using our method (maximum correlation machine learning (MCML) with the autoregressive integrated moving average (ARIMA), trigonometric seasonality, Box-Cox transformation, ARMA errors, trend, and seasonal components (TBATS), threshold autoregressive (TAR), Prophet forecasting model (PROPHET) in the Germany dataset (n = 819) based on the training set (n = 634) and the testing set (n = 185). Bottom left) Comparisons of our method (MCML) to the conventional time series methods for the Germany dataset. Top Right) Comparisons of time-series forecasting using our method (MCML) with the autoregressive integrated moving average (ARIMA), trigonometric seasonality, Box-Cox transformation, ARMA errors, trend, and seasonal components (TBATS), threshold autoregressive (TAR), Prophet forecasting model (PROPHET) in the Australia dataset (n = 819) based on the training set (n = 689) and the testing set (n = 130). Bottom right) Comparisons of our method (MCML) to canonical time series methods for the Australia dataset.

Fig 3 — Analyses of time series from various sources.Non-linear time series data were obtained from financial markets and from entomology observations. Top and middle) Stock market fluctuations represent classical non-linear time series. We retrieved daily data for “INDEX_US_DOW JONES GLOBAL_DJIA” (n = 1521) and applied machine learning based on the training set (n = 1400) and the testing set (n = 121), where the black and red color line represent the true and predicted values of new cases, respectively. The y-axis values are unitless. Top) Prediction of peaks for “Open” Middle) Prediction of peaks for “Close”. Bottom) The Nicholson blowfly experiments were conducted in the 1950s with the intent of learning more about a sheep pest, the blowfly [31,32]. The data involve a system that is nonlinear, has time lags and might be described as non-stationary. Prediction for “eggs” in the data “blowfly97I” (n = 361) using machine learning based on the training set (n = 289) and the testing set (n = 72). Table) The table underneath the graphs display three distinct error calculations. MAE = mean absolute error, RMSE = root mean square error, MAPE = mean absolute percentage error.

Fig 4 — Feature-space plots for various sliding window durations and time lags.For the South Africa data, the three readouts, ac, ami, normalized new cases (nc) are graphed pairwise against each other. A) increasing sliding window durations over 3 time lags. B) increasing time lags over 3 sliding window durations.

Fig 5 — Feature-space plots and local Lyapunov exponents.A,B) The data for South Africa were analyzed. The panels on the left display the feature-space plots for the raw data. In the middle panels, ac + 1 and ami were scaled up by multiplication with the half-maximal value of new cases during the sliding window durations. In the right panels, the values for the new cases were divided by their maximum during the observation window. A) Feature-space plots. Shown are the pairwise feature-space plots for average mutual information (ami), autocorrelation (ac), and new cases (7-day moving average per million inhabitants) (nc), comprising string lengths of 150 days and time lags of 15 days. B) Lyapunov exponents over time. In the upper panel, the individual Lyapunov characteristic exponents are shown in comparison to the suitably scaled new cases over time, for each pair of readouts (i.e., pairwise among nc = new cases, ac = autocorrelation, ami = average mutual information). The red trace indicates the normalized new cases per day. The bottom panel displays the maximum Lyapunov exponents (MLE) over time (blue line) in comparison to the normalized new cases per day (red line).

Fig 6 — Return plot scaling and local Lyapunov exponents.A) Lyapunov exponents over time in several countries. For the analysis, ac + 1 and ami were scaled up by multiplication with the half-maximal value of new cases during the sliding window durations. The string lengths are 150 days and time lags are 15 days. MLE = maximum Lyapunov exponent. B) Scaling of feature-space plots and maximum Lyapunov exponents. All univariate wavelet analyses are based on sliding window durations of 150 days and time lags of 15 days. Each row represents a country. The left column displays unscaled data, the middle column shows the scaling up of ac and ami (as in A)), the right column displays the added step of an arbitrary scale adjustment. This step entails (consecutive to the described prior steps) in Egypt ami*2, in South Africa nc*0.75, in Germany ami*1.7, in Sweden ami*2, in Brazil ac*2.2.

Fig 7 — Analysis of complex time series.Conceptualization of alternative approaches to the study of non-linear data (the graph represents a time series measured in any applicable units). The top row (blue text) represents the model building approach, which makes assumptions, expresses them in sets of differential equations, and then tests their fit to existing measured data. The bottom row (green text) depicts the analysis of noisy complex data, which does not depend on model building but transforms the data for gaining insight. Strategies for cross-validation could improve the ability to make predictions (red text).

Equations19

Funding2

—NIH
—University of Cincinnati

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 epidemiological studies · Gaussian Processes and Bayesian Inference · Ecosystem dynamics and resilience

Full text

Introduction

The spread of infectious diseases is determined by pathogen characteristics (contagion, virulence), host features (immunity, genetic variations, preexisting conditions), and community-directed countermeasures (social distancing, treatments, vaccines). From their exchanges, a complex pattern arises with spikes and troughs in new cases over time, resulting from the competition between factors that facilitate infectious progression and those that ameliorate it.

Diverse strategies have been applied to explain and understand this complexity of infectious propagation. The dissemination of COVID-19 has been modeled through networks of compartments [1,2], cellular automata [3,4], machine learning [5–7], artificial intelligence [8], and others. Commonly, those approaches build models on basic postulations, express them in sets of differential equations, and then map their results to the actual data. The subsequent match is as good as the initial (idealizing) premises and hypotheses permit.

The application of models, based on groups of assumptions and expressed in sets of differential equations, strives to enable prognostication, even over a somewhat extended range. In this time series forecasting, the choice of a model plays a crucial role in predicting future values based on previously observed values. Some of these assumptions are frequently applied. A standard model (S-I-R) divides the population into people who are susceptible to the disease (S), people who are infected by the disease and can spread it to others (I), and people who have recovered or died from the disease (R). S-I-R algorithms seek to detect changes in group size as people move from one group to another. They have been developed and modified in many ways [9,10]. Additional mathematical models use the basic reproduction number (R_0_), which represents the number of previously unexposed individuals who get infected by a new disease carrier. Those others then each go forth to spread it to more people at a similar frequency, and so on. However, all models have idealizations, and their predictions fail when disregarded (“minor”) contributing factors accrue to exert substantial influence.

Likewise, statistical forecasting approaches are fundamentally model-based methods and rely on the assumptions in a specific, structured statistical framework with multiple components (e.g., seasonality, trend) to capture underlying patterns in the time-series data. The autoregressive integrated moving average (ARIMA) models were applied to COVID-19 for predictions of epidemiological trends [11–13], prevalence and incidence [14], the spread of the virus and pandemic life cycle [15], and cumulative confirmed cases [16]. The ARIMA and Prophet models were used to predict a global omicron pandemic [17] and to forecast the number of confirmed and active cases [18]. TBATS (the ARIMA and trigonometric exponential smoothing state space model with Box–Cox transformation, ARMA errors, and trend and seasonal components) was employed to forecast disease severity in hospitalized patients [19], numbers of confirmed cases, deaths, and recoveries [20]. However, time progression series data, especially from epidemics, often do not fit into a specific statistical framework due to inherent complexities.

It is noteworthy that, in various events such as infectious progression, even shorter-term predictions may have a lot of value for the management of instabilities. Hence, their implementation is of practical benefit. Non-linear time progression series do not have closed-form solutions, and as such do not permit long-term predictions. In addition to modeling approaches, the analysis of the actually observed noisy and non-linear data is of critical importance [21]. This strategy does not require mechanistic premises or hypotheses at the outset. It seeks to extract patterns and characteristics from the numbers obtained through measurement in the field. We have recently shown the feasibility of acquiring meaningful information from such evaluations to aid decision making in the public health response [22]. Incompletely addressed, however, was the question to what extent predictions are possible within this framework.

We made the basic assumption that most time series, including contagion, do not represent a Markov process (in which the probability of each event depends only on the current one, and not on the prior history), but that the accumulated chronicle exerts a measurable influence on the future course. Further, the responses to any disturbances, which may arise, will be constrained by the attractor of the system. If this is the case, then – at the very least – short-term to mid-term predictions should be feasible from the study of observed non-linear occurrences. Similar considerations apply to other time series, such as stock market fluctuations and population dynamics.

Because time series data, if they are not entirely irregular, can be considered to represent waves, some tools from harmonic analysis are suitable for their description. Mutual information, in particular, provides a general redout for dependencies between variables and to be applicable to statistical questions [23]. Specifically for infectious spread, autocorrelation and average mutual information of lagged time series (wavelets) have previously been identified as very informative readouts for analytics [22]. When their values were plotted together with the rate of new cases in the same graph, local Lyapunov exponents served as descriptors for the relationships among these data sets. We therefore reasoned that correlation/autocorrelation and average mutual information of lagged time series, as well as Lyapunov exponents of their combined evaluation can provide a basis for forecasting without the reliance on prior model creation.

We also focus on utilizing correlation-based calculations, through machine learning and through empirical testing to develop a novel approach to prognostication that is not dependent on model building. The strategy does not assume any specific underlying statistical structure or generative model for the time series data. Our method leverages historical data to make predictions about the future data by identifying patterns and relationships within the collected data. In this regard, it separates itself from prior efforts [11–20].

Methods

Source data

We analyze the new infections per day, as the 7-day moving averages of the rates per million inhabitants. The source data utilized for the present analysis came from Our World in Data (https://ourworldindata.org/coronavirus) and were retrieved until either May 2022 or March 2023. This study utilizes deidentified data from the public domain. No consent was mandated for participation or publication. All methods were carried out in accordance with relevant guidelines and regulations. The experimental protocols did not require approval from an institutional and/or licensing committee.

Univariate wavelet analysis

Wavelet assessment transforms the time series data from the time domain to the frequency domain. It can be applied to a single time series (univariate wavelet analysis). The calculations for wavelet analyses of new infections were done in R in an exploratory manner to assess the presence of periodic structure. In WaveletComp, the null hypothesis, that there is no periodicity in the series, is tested via p-values obtained from simulation, where the model to be simulated can be chosen from a range of options. The algorithm analyzes the frequency structure of uni- or bivariate time series using the Morlet wavelet [24]. As an important evaluation tool to study periodic phenomena in time series, wavelets are particularly useful in the presence of potential frequency changes across time. We conducted an assessment of lagged data (t versus t + 10 through t versus t + 90) for joint periodicity. The string lengths ranged from 100 to 400 days. Average mutual information and autocorrelation were calculated in R.

Average mutual information

Mutual information uses probabilities (the information one data point gives about the other) to assess correlation. When there is a relationship between two variables (as is the case in lagged time series), one contains information about the other [25,26]. The average mutual information (AMI) represents a non-linear correlation function, which indicates how much common information is shared by the measurements at a particular time $[eqn]$ and a later time $[eqn]$ in a time series $[eqn]$ . The AMI for a given time delay $[eqn]$ is defined as

[eqn]

where $[eqn]$ is the probability that $[eqn]$ is in bin $[eqn]$ , $[eqn]$ is the probability that $[eqn]$ is in bin $[eqn]$ , and $[eqn]$ is the joint probability that $[eqn]$ is in bin $[eqn]$ and $[eqn]$ is in bin $[eqn]$ . The sum of absolute AMI values was calculated as

[eqn]

with the range entailing either all covered lags for one wavelength or all covered wave lengths for one lag where t represents specific date. The AMI was calculated with the mutual function R package tseriesChaos [24].

Autocorrelation

A time series $[eqn]$ sometimes has properties, due to which earlier values display some relation to later values. The autocorrelation statistic measures the degree of that affiliation as it refers to linear dependence. The magnitude of its dimensionless number reflects the extent of similarity. The formula for autocorrelation $[eqn]$ at lag $[eqn]$ is the ratio of autocovariance and variance

[eqn]

where $[eqn]$ .

Correlation coefficient

Given data $[eqn]$ and $[eqn]$ , the correlation coefficient can be calculated by

[eqn]

where $[eqn]$ and $[eqn]$ . Correlation coefficient ranges from −1 to +1, with +1 indicating perfect synchrony and −1 reflecting exact mirror images and with 0 indicating an absence of any correlation.

Two-stage model-independent non-linear time series forecasting approach

Suppose that we observe the time series data $[eqn]$ and aim to predict the future time series data $[eqn]$ , where $[eqn]$ is the time series data on the $[eqn]$ day, $[eqn]$ . Since the time series data had the scale of larger values, without loss of generality, let $[eqn]$ , which compresses the scale of larger values, making the data more suitable for statistical analysis. Thus, the aim is equivalent to the prediction of the future time series data $[eqn]$ . For this, we propose a two-stage model-free approach for non-linear time series forecasting. To predict $[eqn]$ , it is enough to predict the difference between $[eqn]$ and $[eqn]$ . Let $[eqn]$ . A positive value indicates that the current value is greater than the previous value signifying an increase, a negative value indicates that the current value is less than the previous value signifying a decrease and a zero value indicates no change between the current and previous values.

At the first stage, we identify principal lags (PLs) by using linear models with only the first few lags as predictors to obtain less lags while keeping as much of causal effect of lags on data as possible. Specifically, for a given number $[eqn]$ (assume $[eqn]$ here), we consider the following linear models with the $[eqn]$ lags of $[eqn]$ as predictors $[eqn]$ :

[eqn]

where $[eqn]$ is the sample size (assumed $[eqn]$ , $[eqn]$ is the $[eqn]$ lag of $[eqn]$ , and $[eqn]$ is the unknown regression parameter, $[eqn]$ , $[eqn]$ follows a normal distribution with mean of 0 and a variance of $[eqn]$ .

The Bayesian Information Criterion (BIC) is a statistical measure used for model selection from a finite set of models. It is based on the likelihood function and incorporates a penalty term for the number of parameters in the model to avoid overfitting. BIC helps in identifying the model that best explains the data while balancing model complexity and goodness of fit. Lower BIC indicates a better model. We use the BIC as a measure for selecting lags. Let $[eqn]$ be the BIC in the linear model with the last p lags. If

[eqn]

then $[eqn]$ are called the principal lags (PLs) of $[eqn]$ , and $[eqn]$ is called the number of PLs. Suppose that $[eqn]$ is the number of PLs of $[eqn]$ . Let $[eqn]$ denote the vector consisting of the PLs $[eqn]$ .

At the second stage, we propose the maximum correlation prediction for time series forecasting, which links the PLs at the current point to the PLs across all past time points by computing the correlation coefficients and find the optimal PLs that maximize the correlation coefficient. Specifically, for $[eqn]$ , the correlation coefficient of $[eqn]$ and $[eqn]$ given by

[eqn]

where $[eqn]$ and $[eqn]$ . If

[eqn]

then $[eqn]$ is called the most correlated vector of $[eqn]$ .

Note that the difference between $[eqn]$ and $[eqn]$ depends on its neighboring lags, i.e., PLs, while the difference between $[eqn]$ and $[eqn]$ , depends on its neighboring PLs. Since $[eqn]$ and $[eqn]$ are most correlated, it is reasonably assumed that $[eqn]$ will likely be close to $[eqn]$ ., which implies that $[eqn]$ can be used as the predicted value of $[eqn]$ , i.e.,

[eqn]

where $[eqn]$ denotes the predicted difference between $[eqn]$ and $[eqn]$ . Thus,

[eqn]

where $[eqn]$ denotes the predicted value of $[eqn]$ . The algorithms for time series forecasting are presented in Supplement S1, Fig S1 in S1 File.

Maximum correlation machine learning (MCML)

In time series analysis, a tie-based split is needed to separate the data into a training set and a test set. Here, the earlier portion of the data is used for training and the later portion is used for testing. This ensures that the prediction does not learn information from the future, when making predictions on the test set. Our machine learning process for forecasting involves the following steps:

Step 1. Split data into a training set denoted by $[eqn]$ and a testing set denoted by $[eqn]$ , wherein $[eqn]$ is the number of normalized new cases on the $[eqn]$ day, $[eqn]$ , and the training set includes 70–80% of the data while the test set includes the remaining 20–30% of the data.

Step 2. Apply the PLA to identify $[eqn]$ , the PLs of $[eqn]$ , and let $[eqn]$ denote the vector consisting of the PLs.

Step 3. Apply the maximum correlation prediction to calculate the most correlated vector of $[eqn]$

[eqn]

and then calculate the predicted value of $[eqn]$ through the equation

[eqn]

Step 4. Similarly, calculate the next predicted values of $[eqn]$ through the equation

[eqn]

where $[eqn]$ is the most correlated vector of $[eqn]$ .

Step 5. Calculate the three metrics: mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE) for evaluation of the accuracy, which are defined as follows:

[eqn]

[eqn]

and

[eqn]

We compared the new method to conventional prediction algorithms. Autoregressive Integrated Moving Average (ARIMA) is a statistical model that combines autoregression (AR), differencing (I), and moving average (MA) to model and forecast time series data, especially when the data show trends or non-stationarity. ARMA Errors refers to modeling the residuals (errors) of a time series model using an ARMA (Autoregressive Moving Average) process to account for autocorrelation in the errors. Trigonometric Seasonality models seasonal patterns using sine and cosine functions to capture periodic fluctuations in time series data. Box-Cox Transformation represents a power transformation technique used to stabilize variance and make a time series more normally distributed, improving the performance of linear models. Trend and Seasonal Components (TBATS) is a time series forecasting model that includes Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend, and Seasonal components, suitable for complex seasonal patterns and multiple seasonalities. Threshold Autoregressive (TAR) is a nonlinear time series model where the data follow different autoregressive processes depending on whether the series crosses certain threshold values. Prophet Forecasting Model (PROPHET) constitutes a forecasting tool developed by Facebook that decomposes time series into trend, seasonality, and holiday effects, designed to handle missing data and outliers, and to be intuitive for analysts.

Feature-space plots

Feature-space plots map 2 or more features of the time series. From the normalized numbers of new infections, their autocorrelation, and their average mutual information, we initially generated feature-space plots entailing various lags and chain lengths for the latter two readouts. In the case of the data sets under study here, we settled on sliding window durations of 150 days and time lags of 15 days.

For a discrete mapping $[eqn]$ , we calculated the local expansion of the flow by considering the difference between 2 trajectories. The local Lyapunov characteristic exponent λ can be approximated as

[eqn]

for 2 points xn,yn close to each other on the trajectory. In a multidimensional phase space, the maximum Lyapunov exponent (MLE) is the largest value among the pairwise comparisons. Notably, in the particular pairwise comparison of normalized new infections/autocorrelation, autocorrelation/average mutual information, and average mutual information/normalized new infections, a problem of scaling arose, because the mostly larger numbers along the ‘new infections’ axis flattened the trajectories. We applied a dynamic scaling approach, where autocorrelation (AC) and average mutual information (AMI) were brought into comparable ranges by multiplication with the half-maximal number of new cases during the window length (for autocorrelation+1 and directly for AMI).

[eqn]

[eqn]

with max() = maximum of, ncwl = normalized new cases (daily 7-day moving averages per million inhabitants) over the sliding window duration. Further scaling adjustments were explored as described under Results.

Results

Machine learning based on correlation analysis achieves model-free prediction

Based on our observation (S2 in S1 File) that data from time-lagged analysis have potential for anticipating time course progressions, we sought to strengthen the calculations by incorporating machine learning. Starting from observed time series data, the first four lagged variables were identified as principal variables for prediction of the difference between future new cases and past-to-current new cases. We divided the time course series into training (~80%) and test (~20%) sets, then applied the approach dubbed maximum correlation machine learning (MCML) to anticipate the new cases in the test sets. Specifically, we applied the time series values of a chosen sample size (n = 60) from the Australia dataset to the algorithms with the numbers of lags ranging from 2 to 19 and then identified the first 4 lagged variables as principal variables for prediction of the difference between future new cases and past-to-current new cases, because BIC was minimized when the number of lags was 4. The first 4 lagged variables were also identified as principal variables for prediction when we applied the time series values of sample sizes (n = 120, 200) to the linear models (S3, Fig S8 in S1 File). MCML predicted some time series patterns, such as peaks, with a shorter than 4-step time lag or advance (e.g., 2 days lag for peak in the Brazil dataset, 3 days advance for trough in the South Africa dataset, 1 day lag for peak in the Bangladesh dataset). Our MCML for non-linear time series forecasting might be further strengthened by using other hyperparameter tuning techniques or using machine learning based on additional metrics (e.g., kernels), which could guarantee future success.

Fig 1 (Fig 1) illustrates that this algorithm has the capacity to capture time series information from various source data with differing progressions, including trend and auto-correlation patterns. Good forecasting performance was achieved overall, with the best results (among 10 countries analyzed) having been obtained for predicting COVID-19 new cases from Australia and Germany. We also compared the new method to conventional prediction algorithms in terms of MAE, RMSE and MAPE. Among ARIMA, ARMA, TBATS, TAR, and PROPHET, none of the prior algorithms performed satisfactorily, and all fell short in comparison to MCML in terms of MAE, RMSE and MAPE (Fig 2).

Machine learning for model-free prognostication.The time series of normalized new infections in 4 countries were utilized in a machine learning algorithm based on autocorrelation. The black and red color lines represent the true and anticipated numbers of new cases. All values are on the scale of log plus 0.05. The sizes of the training set and the testing set were chosen to be about 80% and 20%. The tables underneath the graphs display three distinct error calculations. MAE = mean absolute error, RMSE = root mean square error, MAPE = mean absolute percentage error. Top left) Prediction of peaks for the data Germany (n = 819) based on the training set (n = 634) and the testing set (n = 185). Top right) prediction of peaks for the data Great Britain (n = 814) based on the training set (n = 592) and the testing set (n = 222). Bottom left) prediction of peaks for the data Australia (n = 819) based on the training set (n = 689) and the testing set (n = 130). Bottom right) Prediction of peaks for the data USA (n = 802) based on the training set (n = 635) and the testing set (n = 167).

Validation of the machine learning for model-free prognostication.The figure provides the comparative performance results for the 5 methods. The MCML method had the best performance of all forecasting methods and generated forecasting results with the lowest MAE, MAPE, and RMSE in each case. Top left) Comparisons of time-series forecasting using our method (maximum correlation machine learning (MCML) with the autoregressive integrated moving average (ARIMA), trigonometric seasonality, Box-Cox transformation, ARMA errors, trend, and seasonal components (TBATS), threshold autoregressive (TAR), Prophet forecasting model (PROPHET) in the Germany dataset (n = 819) based on the training set (n = 634) and the testing set (n = 185). Bottom left) Comparisons of our method (MCML) to the conventional time series methods for the Germany dataset. Top Right) Comparisons of time-series forecasting using our method (MCML) with the autoregressive integrated moving average (ARIMA), trigonometric seasonality, Box-Cox transformation, ARMA errors, trend, and seasonal components (TBATS), threshold autoregressive (TAR), Prophet forecasting model (PROPHET) in the Australia dataset (n = 819) based on the training set (n = 689) and the testing set (n = 130). Bottom right) Comparisons of our method (MCML) to canonical time series methods for the Australia dataset.

The machine learning algorithm is generally applicable to non-linear time series, regardless of the data origin. Therefore, we validated it in financial fluctuations of the Dow-Jones Index as well as population dynamics of blowfly reproduction, both with convincing results (Fig 3). The MCML algorithm works well for time series forecasting when the series exhibits repeating patterns or local dynamics, where short-term behavior mirrors past similar periods. However, it experiences confines when limited data are available. Also, as is the case for all forecasting algorithms, MCML quality declines for very noisy data, which obscure the underlying patterns and relationships to be learned.

Analyses of time series from various sources.Non-linear time series data were obtained from financial markets and from entomology observations. Top and middle) Stock market fluctuations represent classical non-linear time series. We retrieved daily data for “INDEX_US_DOW JONES GLOBAL_DJIA” (n = 1521) and applied machine learning based on the training set (n = 1400) and the testing set (n = 121), where the black and red color line represent the true and predicted values of new cases, respectively. The y-axis values are unitless. Top) Prediction of peaks for “Open” Middle) Prediction of peaks for “Close”. Bottom) The Nicholson blowfly experiments were conducted in the 1950s with the intent of learning more about a sheep pest, the blowfly [31,32]. The data involve a system that is nonlinear, has time lags and might be described as non-stationary. Prediction for “eggs” in the data “blowfly97I” (n = 361) using machine learning based on the training set (n = 289) and the testing set (n = 72). Table) The table underneath the graphs display three distinct error calculations. MAE = mean absolute error, RMSE = root mean square error, MAPE = mean absolute percentage error.

Maximum local Lyapunov exponents from normalized feature-space plots spike with infection peaks

In time-lagged analysis, important conclusions can be inferred from the assessments of autocorrelation and average mutual information. For the purpose of prediction, the results from both algorithms are relevant in correlation to the peaks of the normalized new cases (NC). Feature-space plots with each of these three readouts on the axes (AC, AMI, NC) may be of importance. In this depiction, a rapid increase or decrease in new infections is reflected in a close-to straight line, oscillations generate a near-toroid attractor, while successful management shrinks the torus and moves it closer to the origin. However, sliding window durations and time lags need to be chosen suitably (Fig 4). Also, a problem of scaling arises. The values for autocorrelation are strictly confined to the interval −1 to +1, and the average mutual information tends to be in the low single digits (while not applied here, it can be normalized to an interval of [0,1] when expressed as the mutual information of the variables x and y, divided by the mean entropy of x and y). By contrast, the new cases, even consecutive to having been scaled in the form of 7-day moving averages per million inhabitants, can be in the hundreds. Resultantly, the calculation of maximum local Lyapunov exponents (MLE) is skewed in one dimension of the 3-dimensional return plot. Therefore, another scaling step is required for graphing the readouts for these three parameters in one return plot. The simple normalization of the new cases-axis by division through the highest number over the entire observation period may be meaningful retrospectively, but this adjustment does not support the progressive day-by-day analysis, in which another, higher peak may arise with every new data point. We therefore sought to apply a normalization that covers only the observation period used for the time-lagged analysis (i.e., the observation window in days; as the length stays constant and the wavelet moves with every new data point, this equates to a dynamic scaling that adjusts day-by-day). Notably, dividing the normalized new cases by their largest value within the sliding window duration seemed to compress the high end of the spectrum near the maximum of 1.0 relative units (because, during an upswing, every day introduces a new highest value). By contrast, the scaling up of average mutual information and (autocorrelation + 1) by multiplication with the half-maximal number of new cases during the observation window brought the quantities for all three parameters onto comparable scales, and in consequence the spikes of the MLE values correlated much better with the peaks of new cases per day than did the MLE for the unscaled graphs (Fig 5A, 5B). When this scaling method was corroborated on multiple data sets, the MLE values spiked coincidingly with or ahead of peaks in new infections (Fig 6A). Even after these adjustments, further scrutiny identified minor remaining discrepancies in the scales of the return plot axes. We therefore tested an additional arbitrary scaling step, which used multiplication with a constant to bring to scale the values of one suboptimally matched axis. This achieved further improvement in aligning the MLE spikes with the peaks in normalized new cases (Fig 6D).

Feature-space plots for various sliding window durations and time lags.For the South Africa data, the three readouts, ac, ami, normalized new cases (nc) are graphed pairwise against each other. A) increasing sliding window durations over 3 time lags. B) increasing time lags over 3 sliding window durations.

Feature-space plots and local Lyapunov exponents.A,B) The data for South Africa were analyzed. The panels on the left display the feature-space plots for the raw data. In the middle panels, ac + 1 and ami were scaled up by multiplication with the half-maximal value of new cases during the sliding window durations. In the right panels, the values for the new cases were divided by their maximum during the observation window. A) Feature-space plots. Shown are the pairwise feature-space plots for average mutual information (ami), autocorrelation (ac), and new cases (7-day moving average per million inhabitants) (nc), comprising string lengths of 150 days and time lags of 15 days. B) Lyapunov exponents over time. In the upper panel, the individual Lyapunov characteristic exponents are shown in comparison to the suitably scaled new cases over time, for each pair of readouts (i.e., pairwise among nc = new cases, ac = autocorrelation, ami = average mutual information). The red trace indicates the normalized new cases per day. The bottom panel displays the maximum Lyapunov exponents (MLE) over time (blue line) in comparison to the normalized new cases per day (red line).

Return plot scaling and local Lyapunov exponents.A) Lyapunov exponents over time in several countries. For the analysis, ac + 1 and ami were scaled up by multiplication with the half-maximal value of new cases during the sliding window durations. The string lengths are 150 days and time lags are 15 days. MLE = maximum Lyapunov exponent. B) Scaling of feature-space plots and maximum Lyapunov exponents. All univariate wavelet analyses are based on sliding window durations of 150 days and time lags of 15 days. Each row represents a country. The left column displays unscaled data, the middle column shows the scaling up of ac and ami (as in A)), the right column displays the added step of an arbitrary scale adjustment. This step entails (consecutive to the described prior steps) in Egypt ami2, in South Africa nc0.75, in Germany ami1.7, in Sweden ami2, in Brazil ac2.2.*

We approximated a first derivative (change between consecutive days) by calculating the value of t + 1 minus the value of t for the normalized new cases, the autocorrelation, and the average mutual information (S4, Fig S9A in S1 File). While the daily changes in autocorrelation were fairly reflective of the daily changes in new cases, the daily changes in average mutual information contained too much noise to be useful. The differentials in local Lyapunov exponents for each axis of the 3-dimensional return plot impressed as uninformative (S4, Fig S9B in S1 File).

Discussion

The analysis of complex time series has increasingly relied on model building, which makes assumptions on underlying connections, composes non-linear differential equations, runs them through computer analysis, and matches them to the original data. Close congruence is taken as confirmation that the model has validity. Deviations are seen as an impetus to add variables and equations to the model. This approach starts with assumptions and ends with observed data. While it has been widely practiced, its results are contingent with the accuracy of the model built. There is feasibility to obtain information directly from the observed data without relying on model building first.

For the analysis of non-linear time series, lagged variables have proven useful for characterizing the curves of events versus time. Dynamic models (e.g., ARMA models, ARDL models, etc.) have been developed using linear combinations of various types of lagged variables. In the present investigation, we find that the values for average mutual information and autocorrelation/correlation (calculated from distinct lags of univariate wavelet analysis) can be processed to anticipate spikes in new infections. Rather than relying on a preconceived model, this strategy solely assumes that a) the time series under study does not represent a Markov chain but that its history influences and constrains its future development b) the response to disturbances is constrained by the attractor of the system. Hence, information gained from the past enables at least probabilistic projections on the future. Beyond infectious progression, this approach is applicable to time series of any origin (and has been validated here with data from financial markets and population dynamics).

Studies of complex occurrences have utilized harmonic analysis, adaptive landscapes, and conceptual spaces [27]. While the present study relies mostly on the first modality, it has incorporated conceptual spaces. Feature-space plots are adjustable to display maximum Lyapunov exponents that spike concomitantly with, or preceding to a peak in new infections. Proof-of-principle is established that forecasting of observed complex data is feasible on the basis of analytics, without the need for fitting models to the raw data first. There is opportunity for expansions of this approach, such as the inclusion of embedding dimensions or of Fourier analysis. Regarding the machine learning methodology, the inclusion of deep learning techniques can be explored [28–30].

Enabling accurate prediction of an upcoming peak in disease dissemination requires investigations, as to whether the most suitable basis is provided by the differential equations of a model or by the analysis of observed non-linear data or by any combination that connects those distinct approaches. It is reasonable to assume that combinations of model-based analytics with direct analytics of the noisy non-linear data will have the best prospects for mapping and predicting the course of complex phenomena (Fig 7). The approaches outlined here are not limited to epidemics or pandemics but may also be of value for the assessment or the prediction of instabilities in a plethora of other non-linear time series.

Analysis of complex time series.Conceptualization of alternative approaches to the study of non-linear data (the graph represents a time series measured in any applicable units). The top row (blue text) represents the model building approach, which makes assumptions, expresses them in sets of differential equations, and then tests their fit to existing measured data. The bottom row (green text) depicts the analysis of noisy complex data, which does not depend on model building but transforms the data for gaining insight. Strategies for cross-validation could improve the ability to make predictions (red text).

Conclusion

Non-linear time series do not have closed form solutions. Under existing paradigms, forecasting is achieved via fitting of models. Analytics of noisy, non-linear data can accomplish description, which implies the possibility of some prognostication. Machine learning on the basis of correlation coefficient, utilizing about 80% of the time series as training sets, was able to generate excellent predictions for the time series progression. Time-lagged analysis and feature-space plots require some adjustments of conventional approaches. They are partially successful.

Supporting information

S1. FileSupplement.(DOCX)

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Wang K, Ding L, Yan Y, Dai C, Qu M, Jiayi D, et al. Modelling the initial epidemic trends of COVID-19 in Italy, Spain, Germany, and France. P Lo S One. 2020;15(11):e 0241743. doi: 10.1371/journal.pone.0241743 33166344 PMC 7652319 · doi ↗ · pubmed ↗
2Khan ZS, Van Bussel F, Hussain F. A predictive model for Covid-19 spread - with application to eight US states and how to end the pandemic. Epidemiol Infect. 2020;148:e 249. doi: 10.1017/S 0950268820002423 33028445 PMC 7588724 · doi ↗ · pubmed ↗
3Bin S, Sun G, Chen CC. Spread of Infectious Disease Modeling and Analysis of Different Factors on Spread of Infectious Disease Based on Cellular Automata. Int J Environ Res Public Health. 2019;16(23).10.3390/ijerph 16234683 PMC 692690931775236 · doi ↗ · pubmed ↗
4Jithesh PK. A model based on cellular automata for investigating the impact of lockdown, migration and vaccination on COVID-19 dynamics. Comput Methods Programs Biomed. 2021;211:106402. doi: 10.1016/j.cmpb.2021.106402 34530391 PMC 8423711 · doi ↗ · pubmed ↗
5Mehta M. Early stage machine learning-based prediction of US county vulnerability to the COVID-19 pandemic: machine learning approach. JMIR Public Health Surveill. 2020;6(3):e 19446.10.2196/19446 PMC 749000232784193 · doi ↗ · pubmed ↗
6Malki Z. The COVID-19 pandemic: prediction study based on machine learning models. Environ Sci Pollut Res Int. 2021;28(30):40496–506.33840016 10.1007/s 11356-021-13824-7PMC 8035887 · doi ↗ · pubmed ↗
7Noy O, Coster D, Metzger M, Atar I, Shenhar-Tsarfaty S, Berliner S, et al. A machine learning model for predicting deterioration of COVID-19 inpatients. Sci Rep. 2022;12(1):2630. doi: 10.1038/s 41598-022-05822-7 35173197 PMC 8850417 · doi ↗ · pubmed ↗
8Scoles S. AI tools aim to speed up outbreak modeling. Science. 2025;390(6778):1089–90. doi: 10.1126/science.aee 6256 41379952 · doi ↗ · pubmed ↗