Probabilistic forecasting of monthly dengue cases using epidemiological and climate signals: A BiLSTM-Negative Binomial Model versus Mechanistic and Count-Model Baselines

Michael Marko Sesay; Antony Ngunyi; Herbert Imboga; Helen Howard; Julia Robinson

PMC · DOI:10.1371/journal.pgph.0005404·March 27, 2026

Probabilistic forecasting of monthly dengue cases using epidemiological and climate signals: A BiLSTM-Negative Binomial Model versus Mechanistic and Count-Model Baselines

Michael Marko Sesay, Antony Ngunyi, Herbert Imboga, Helen Howard, Julia Robinson

PDF

Open Access

TL;DR

This study compares different forecasting models to predict dengue cases in Freetown, Sierra Leone, using climate and historical data to help health systems prepare for outbreaks.

Contribution

The paper introduces a novel comparison of probabilistic forecasting models for dengue, including a BiLSTM-Negative Binomial approach, under leakage-safe conditions.

Findings

01

INGARCH-NB achieved the best mean log score across all forecast horizons, indicating superior distributional accuracy.

02

BiLSTM-NB provided reliable uncertainty estimates at longer horizons but with wider predictive intervals.

03

Adding lag-1 climate inputs had minimal impact on model performance for most approaches.

Abstract

Reliable short-term forecasts enable urban health systems to anticipate dengue surges and allocate resources effectively. We assembled monthly dengue case counts for Freetown, Sierra Leone (2015–2024), and compared four probabilistic model families under a leakage-safe, rolling-origin evaluation at 1–3-month horizons: a negative binomial generalized linear model (NB-GLM), a negative binomial INGARCH model (INGARCH-NB), a mechanistic renewal model with negative binomial observations (Renewal-NB), and a bidirectional long short-term memory network with a negative binomial output (BiLSTM-NB). All models used the same seasonal harmonics and autoregressive lags; “light” climate inputs (rainfall, temperature, and relative humidity) were restricted to lag-1 covariates to reflect real-time availability. We evaluated probabilistic performance using mean log score (primary), empirical coverage,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens

Diseases1

dengue

Figures13

Click any figure to enlarge with its caption.

Fig 1 — Monthly reported dengue cases over the study period, showing clear seasonality and recurrent annual peaks in Freetown.

Fig 2 — Average monthly case profile highlighting the seasonal transmission pattern in Freetown.

Fig 3 — Global-aligned mean log score by model and horizon (higher is better).Barplots of the mean log score on the global-aligned set (n = 32 per horizon). INGARCH-NB ranks best across horizons, with BiLSTM-NB competitive; renewal-based variants are penalized by diffuse distributions, and GLM variants by under-dispersion.

Fig 4 — Global-aligned 90% PI coverage by model and horizon (closer to 90% is better).Coverage of nominal 90% predictive intervals on the global-aligned set. GLM variants under-cover substantially; BiLSTM-NB attains very high coverage at longer horizons; INGARCH-NB maintains generally good calibration without extreme width inflation.

Fig 5 — Global-aligned median 90% PI width by model and horizon (smaller is sharper).Median 90% PI widths highlight uncertainty inflation for renewal-based variants at longer horizons. GLM variants remain narrow but undercover; INGARCH-NB offers a better calibration-sharpness trade-off.

Fig 6 — Significant wins by model and horizon (DM tests, p < 0.05).Count of pairwise comparisons in which each model significantly improves mean log score over another model under HAC-robust DM testing. INGARCH-NB achieves the most consistent significant wins across horizons.

Fig 7 — Global-aligned overview: accuracy, calibration, and width across horizons.Line summaries across horizons showing the accuracy-calibration-sharpness trade-off. INGARCH-NB is consistently strong in log score with generally good calibration and moderate widths; GLM variants under-cover; renewal variants inflate width at longer horizons.

Fig 8 — Seasonal log score patterns (aligned): BiLSTM-NB vs INGARCH-NB at h = 1.Month-of-year mean log scores on the aligned set (small per-month counts shown). Both models perform best around months 3–4 and deteriorate mid-year; INGARCH-NB is typically more stable across months.

Fig 9 — Seasonal log score patterns (aligned): BiLSTM-NB vs INGARCH-NB at h = 2.Month-of-year comparison at h = 2 shows similar seasonal structure and mid-year difficulty; interpret descriptively due to small per-month sample sizes.

Fig 10 — Seasonal log score patterns (aligned): BiLSTM-NB vs INGARCH-NB at h = 3 month-of-year comparison at h = 3 shows persistent mid-year performance degradation; INGARCH-NB generally remains less variable than BiLSTM-NB.

Fig 11 — Regime-stratified mean log score vs horizon.Points show mean log score by model, horizon, and regime (non-outbreak vs outbreak). Higher (less negative) is better. The outbreak subset is small, so dispersion across models is expected.

Fig 12 — Era-based predictive accuracy (2021-2024).Mean log score by model and horizon on the 2021-2024 aligned subset (higher is better).

Fig 13 — Era-based calibration and sharpness (2021-2024).Empirical coverage (50%, 90%) and median predictive-interval widths on the 2021–2024 aligned subset.

Equations38

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMosquito-borne diseases and control · COVID-19 epidemiological studies · Flood Risk Assessment and Management

Full text

Introduction

Dengue fever remains one of the most pervasive vector-borne diseases worldwide, affecting tropical and subtropical regions with an expanding geographic footprint. The spread of dengue continues to accelerate, driven by recurring outbreaks that cause substantial morbidity and strain public health systems. This expansion is fueled by complex interactions between climatic conditions, which shape mosquito breeding habitats and virus survival, and rapid urbanization, which increases human–mosquito contact. These factors collectively amplify transmission dynamics, triggering more frequent outbreaks and posing persistent challenges for disease control. Given these complexities, there is a critical operational need for reliable short-term forecasting tools to enable health authorities to anticipate dengue incidence and allocate resources efficiently [1–3].

Among regions facing emerging dengue risks, West Africa presents epidemiological and surveillance characteristics that warrant focused attention. Accumulating evidence highlights sustained local dengue transmission, contradicting earlier assumptions that cases are primarily sporadic or imported. The region’s ecological and socio-economic context—including seasonal rainfall patterns, temperature variability, and rapid urban growth—influences mosquito population dynamics and dengue transmission potential. This evolving landscape underscores the need for improved situational awareness through enhanced surveillance and data-driven forecasting. Timely, region-specific information is crucial for mobilizing interventions and containing outbreaks that impose considerable health and economic burdens on vulnerable populations [4–6].

Freetown, the capital of Sierra Leone, serves as a pertinent setting for operational dengue forecasting given the availability of routine surveillance data at a monthly cadence and documented dengue activity. Its coastal urban environment and climatic conditions support vector proliferation, creating a practical need for forecasting to inform public health decision-making. Monthly forecasting represents a pragmatic compromise between data availability and operational utility: it aligns with common reporting workflows and supports planning for staffing, diagnostics, and vector-control activities on a near-term horizon. However, because monthly aggregation can obscure rapid shifts in incidence, we emphasize leakage-safe evaluation and robustness checks when comparing model classes at this time scale [7,8].

Monthly dengue counts typically exhibit overdispersion, strong annual seasonality, short serial dependence, and potentially non-linear relationships with environmental drivers, posing challenges for standard time-series approaches [9,10]. Negative binomial generalized linear models (NB-GLMs) offer interpretable covariate effects and handle overdispersion, but they may inadequately represent temporal feedback dynamics [11,12]. Negative binomial INGARCH models explicitly incorporate the dependence of the conditional mean on both past observations and past conditional means, providing an observation-driven approach for count time series [13,14]. Renewal models link incidence to a time-varying reproduction number (Rt) and a serial-interval kernel, supporting epidemiological interpretation while remaining parsimonious [15]. Modern bidirectional long short-term memory (BiLSTM) architectures with negative binomial output heads can learn non-linear patterns while producing probabilistic count forecasts [16]. However, time-series machine learning remains susceptible to information leakage through improper feature construction and validation design, necessitating careful feature timing and rolling-origin evaluation.

Despite extensive methodological development, leakage-safe and aligned comparisons of regression baselines, observation-driven count models, mechanistic renewal formulations, and deep sequence models for monthly dengue forecasting in West Africa remain limited [17]. We address this gap through an aligned, expanding-window evaluation in Freetown comparing four models: an NB-GLM, INGARCH-NB, Renewal-NB, and a BiLSTM-NB architecture featuring autoregressive skip connections and optional isotonic calibration. We analyze monthly reported dengue cases in Freetown (2015–2024) alongside monthly rainfall, air temperature, and relative humidity aggregates as potential environmental drivers. To reflect operational feasibility, we utilize a “light” climate feature set—limited to three variables—and apply conservative lagging rules; we also report sensitivity analyses examining alternative climate specifications and key mechanistic assumptions. All models incorporate 12-month harmonic terms and autoregressive lags ( $[eqn]$ )to capture seasonal and short-term dependence, and we enforce leakage-safe timing for all inputs.

We employ an expanding-window rolling origin protocol for forecast horizons of $[eqn]$ months, using a minimum training length of 48 months to stabilize seasonal estimation. Evaluation prioritizes the mean log score as a strictly proper scoring rule, alongside 50% and 90% prediction-interval coverage and median interval widths, to summarize calibration and sharpness. Distributional calibration is assessed using probability integral transform (PIT) diagnostics adapted for counts, while the statistical significance of forecast differences is tested using Diebold–Mariano tests with Newey–West standard errors on aligned issue–target indices [18]. Our contributions include:

Leakage-safe feature timing, including conservative lagging of climate inputs; a seed-ensemble BiLSTM-NB with autoregressive skip connections; and optional isotonic calibration to improve reliabilityA head-to-head comparison of NB-GLM (direct forecasting), INGARCH-NB (observation-driven), Renewal-NB (mechanistic), and BiLSTM-NB under shared seasonal and autoregressive structureAligned backtesting enabling fair Diebold–Mariano comparisons, with unaligned results preserved in supplementary materialsOperational evaluation emphasizing proper scoring rules, reliability, and sharpness for public health decision support, alongside robustness checks for key modeling assumptions

The remainder of this paper is organized as follows: Section 2 describes data sources, feature engineering, model formulations, experimental setup, and evaluation metrics. Section 3 presents comparative results, diagnostic analyses, and robustness checks. Section 4 discusses implications for operational dengue forecasting in resource-limited settings and identifies future research directions.

Materials and methods

Study setting, outcome, and covariates

Study setting and time span.

We curated a dengue surveillance and climate dataset for Freetown, Sierra Leone, spanning January 2015 to December 2024. The dataset links monthly dengue case totals to monthly meteorological summaries to support leakage-safe probabilistic forecasting at 1–3 month horizons (see S1 Data).

Dengue surveillance outcome.

Let $[eqn]$ denote the number of reported dengue cases in month t. Each observation represents the total count of laboratory-confirmed and clinically suspected dengue infections recorded in the Freetown catchment during that calendar month.

Climate covariates.

Monthly climate covariates were obtained from publicly available meteorological sources and aligned to the dengue reporting calendar: precipitation (mm; monthly total), near-surface air temperature (°C; monthly mean), and relative humidity (%; monthly mean). These covariates were selected because they are plausibly linked to Aedes mosquito ecology and dengue transmission and because they are readily available in operational settings.

Exploratory summary.

To characterize seasonality and interannual variability at the monthly scale, we summarize the dengue series using (i) a time plot and (ii) an average monthly profile ( Figs 1 and 2). The series exhibits pronounced annual seasonality with recurrent peaks, motivating the inclusion of seasonal harmonic terms and autoregressive lags shared across all model classes.

Monthly reported dengue cases over the study period, showing clear seasonality and recurrent annual peaks in Freetown.

Average monthly case profile highlighting the seasonal transmission pattern in Freetown.

Preprocessing and feature engineering

Calendar alignment and outcome construction.

All records were aligned to a complete monthly calendar from January 2015 to December 2024. The analysis outcome is the monthly count Yt in month t, with $[eqn]$ by construction.

Seasonal harmonics and autoregressive lags.

To represent annual seasonality, we construct 12-month trigonometric harmonics from the calendar month index $[eqn]$ :

[eqn]

Short- and medium-range dependence is represented by integer lags of the case series.

[eqn]

Lags are used only when available; issue months without the required lagged values (e.g., at the beginning of the series) are excluded for the relevant model/horizon.

Climate feature set and leakage-safe timing

Environmental drivers are limited to a “light” set of at most three monthly climate aggregates, precipitation, temperature, and relative humidity, to reflect operational feasibility. To prevent look-ahead, climate covariates used for forecasting issue month t are restricted to values available at or before t. In the primary specification, we adopt conservative lagging:

For all models, precipitation is used at lag-1 by design.For NB-GLM, INGARCH-NB, and Renewal-NB, temperature and humidity are used at lag-1 in the primary analysis. Contemporaneous values are considered only in a sensitivity analysis under an explicit assumption about reporting latency.For BiLSTM-NB, all climate inputs are strictly lagged by one month: $[eqn]$

Let xt denote the vector of selected climate covariates at month t. Climate covariates are standardized within each training fold by

[eqn]

where $[eqn]$ and $[eqn]$ are computed only on the current training window and applied to the corresponding validation/test issues. Trigonometric harmonics $[eqn]$ are scaled analogously. Count lags $[eqn]$ are left unscaled.

Recurrent-model inputs and targets.

For recurrent models, we form leakage-safe supervised sequences with a fixed lookback W = 12 months. The count stream for issue t is

[eqn]

and the auxiliary feature vector is

[eqn]

with climate at lag -1 only. We construct multi-step targets for h = 1,2,3 months ahead,

[eqn]

Only issues for which all required elements of Ct and At are present are used for training and evaluation. Targets Yt+h are never used in feature computation at issue t.

Audit trail for alignment.

To support auditability, we persist per-issue $[eqn]$ keys and fold-specific scaling statistics used at each issue. These artifacts allow exact regeneration of aligned evaluation sets and verification of leakage safeguards.

Missing data handling.

After alignment to a complete monthly calendar (January 2015-December 2024; T = 120 months), we verified the completeness of the dengue outcome and the selected “light” climate covariates (precipitation, temperature, and relative humidity). The aligned analysis table contains no missing values in Yt or in any selected climate variable (0/120 missing months for each field; 0.0% missing overall). Consequently, no imputation was performed and no months were dropped due to missingness in the primary analysis.

Sensitivity (not applicable for this dataset).

Because there are no missing values in the aligned series, missing-data sensitivity analyses (e.g., alternative imputation strategies or complete-case versus imputed comparisons) are not applicable. We state this explicitly to document that the absence of imputation reflects data completeness rather than an omitted methodological detail.

Operational procedure under missingness (deployment guidance).

In prospective operational settings where climate feeds may be delayed or incomplete, missing covariates should be handled within each rolling-origin training fold to preserve leakage safety (i.e., imputation parameters computed using training data only, then applied to the corresponding forecast issue). Missing dengue outcomes should not be imputed as forecast targets; instead, affected issue-target pairs should be excluded from scoring and clearly logged in the per-issue audit trail (issue date, target date, and missingness flags).

Models

All models produce probabilistic forecasts for monthly counts using a negative binomial (NB2) observation model with mean $[eqn]$ and dispersion $[eqn]$ , where $[eqn]$ . Except where noted, we fit separate direct models per horizon (h). Model inputs follow the leakage-safe timing rules in Sections miss-preprocess, and all standardization parameters are estimated on training folds only.

Probabilistic forecasting models

All approaches produce probabilistic forecasts for monthly counts under a negative-binomial NB2 observation model with mean $[eqn]$ and dispersion $[eqn]$ , such that $[eqn]$ . Unless otherwise stated, models are trained and evaluated separately for each forecast horizon $[eqn]$ using a direct strategy. Model inputs follow the leakage-safe timing rules in Sections preprocess and climate_timing, and all standardization parameters are estimated on training folds only. For likelihood computations, we use the (r,p) parameterization with $[eqn]$ and $[eqn]$ .

NB-GLM

The negative binomial generalized linear model (NB-GLM) extends the Poisson GLM to accommodate overdispersion commonly observed in dengue counts [11,12]. We adopt the NB2 mean-variance relationship. For horizon $[eqn]$ , the monthly count Yt+h conditional on the information set $[eqn]$ is modeled as

[eqn]

where $[eqn]$ and $[eqn]$ is the overdispersion parameter [19]. As $[eqn]$ the model approaches Poisson, larger $[eqn]$ implies greater overdispersion [20]. For likelihood computations we use the (r,p) parameterization

[eqn]

with pmf

[eqn]

We use a horizon-specific linear predictor (direct strategy) [21]:

[eqn]

where mt is the calendar month, and $[eqn]$ is a light climate vector with at most three lag-1 covariates chosen from rainfall, air temperature, and relative humidity (Section preprocess). This lag-1 restriction is the primary specification to prevent leakage. If an operational pipeline provides reliable same-month climate readings at issue time t, a contemporaneous variant is treated as a separate sensitivity analysis under an explicit reporting-delay assumption. Seasonal harmonics and climate are standardized on the training folds only; count lags remain unscaled.

Let $[eqn]$ be the set of training indices after respecting maximal lag and the minimum training length. With $[eqn]$ and $[eqn]$ , $[eqn]$ , the log-likelihood is

[eqn]

We estimate $[eqn]$ by maximum likelihood (standard NB2 GLM fitting; implementation details in Supporting Information). All features in xt are computed from information available at issue time t; standardization parameters are learned on training folds and applied to validation/test folds.

For a new issue T,

[eqn]

Point forecasts are $[eqn]$ . Central $[eqn]$ prediction intervals use NB quantiles

[eqn]

e.g., $[eqn]$ (50%) and $[eqn]$ (90%).

INGARCH-NB

The integer-valued GARCH-type model with negative-binomial innovations (INGARCH-NB) adapts volatility-style feedback to count data, capturing short-memory dependence and overdispersion frequently observed in dengue surveillance series [13,22]. We adopt the NB2 mean-variance form to maintain consistency across model families.

Let $[eqn]$ be the natural filtration. We assume

[eqn]

with overdispersion $[eqn]$ . For likelihood evaluation, we map to (r,p) with $[eqn]$ and $[eqn]$ .

We use a log link with seasonal harmonics, observed-count feedback, and conditional-mean feedback:

[eqn]

so that $[eqn]$ . The $[eqn]$ transform stabilizes the feedback at zero counts and avoids numerical issues [14]. The term in $[eqn]$ provides persistence in the conditional mean. Seasonal harmonics capture annual dengue cyclicality. In the main analysis, we omit climate regressors to keep the comparison focused on endogenous dynamics; a climate-augmented variant can be evaluated as a sensitivity check.

Given y1:T and initialization $[eqn]$ , the log-likelihood is

[eqn]

where $[eqn]$ , $[eqn]$ , and $[eqn]$ is generated recursively from (17). We compute the maximum likelihood estimator $[eqn]$ using box-constrained quasi-Newton (L–BFGS-B), with constraints $[eqn]$ and $[eqn]$ to discourage explosive feedback. In practice, finite-difference gradients with stable initialization (e.g., $[eqn]$ ) are adequate.

Multi-horizon forecasting

To obtain forecasts at horizons h = 1,2,3, we use an iterated predictive scheme consistent with the INGARCH recursion. For h = 1, the predictive distribution is $[eqn]$ with $[eqn]$ given by (17). For h > 1, we propagate uncertainty forward by Monte Carlo simulation: for $[eqn]$ , we draw $[eqn]$ from the h = 1 predictive distribution, update the recursion to obtain $[eqn]$ , draw $[eqn]$ , and continue up to T + h. The resulting empirical distribution $[eqn]$ defines the probabilistic forecast, from which we compute point forecasts (mean or median), prediction intervals, and proper scoring rules. We use B large enough to stabilize scores (Supporting Information).

Renewal-NB

We adopt an epidemiological renewal model with an NB2 observation process as a mechanistic baseline [15,23]. At a monthly resolution, the serial-interval kernel should be interpreted as an effective kernel that aggregates within-month transmission and reporting delays; we therefore evaluate kernel sensitivity in (Section sensitivity).

Let $[eqn]$ denote the information set up to month $[eqn]$ . Counts follow

[eqn]

with $[eqn]$ and $[eqn]$ . The discrete renewal equation is

[eqn]

where $[eqn]$ is a nonnegative kernel. In the baseline specification, we use S = 3 with $[eqn]$ as a front-loaded effective kernel at the monthly scale. The effective reproduction number is seasonally modulated,

[eqn]

ensuring Rt > 0.

Kernel sensitivity.

To address potential misspecification at a monthly resolution, we evaluate alternative kernel supports and shapes as a sensitivity analysis:

Support: $[eqn]$ months.Shapes: (i) front-loaded geometric decay $[eqn]$ with $[eqn]$ , and (ii) diffuse kernels (e.g., discretized gamma) normalized to sum to one.

We refit the renewal model under each kernel and compare probabilistic scores and calibration diagnostics to determine whether conclusions about renewal performance are robust to kernel choice.

Given y1:T and support S, the renewal recursion is defined for t > S. The log-likelihood under NB2 is

[eqn]

with $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ given by (21)–(22).

We obtain the MLE $[eqn]$ via box-constrained L–BFGS–B, with $[eqn]$ and a soft bound on seasonal amplitude (e.g., $[eqn]$ ) to prevent unrealistically large forcing. Gradients follow from $[eqn]$ where $[eqn]$ . In particular,

[eqn]

and $[eqn]$ is obtained from (24). In practice, finite-difference derivatives are sufficient due to the low parameter dimension.

Multi-horizon forecasting.

For h = 1, the forecast mean is $[eqn]$ . For h > 1, we generate probabilistic forecasts by iterating the renewal recursion with Monte Carlo simulation: we draw future paths from the NB2 predictive distribution and update the renewal term using simulated counts, yielding an empirical forecast distribution for YT+h. Prediction intervals are computed from the corresponding empirical quantiles.

BiLSTM-NB

The Bidirectional Long Short-Term Memory model with a Negative-Binomial output head (BiLSTM-NB) couples deep sequence representations with a count likelihood tailored to overdispersed dengue surveillance data. It learns non-linear dependencies while producing horizon-specific predictive distributions suitable for probabilistic evaluation [24,25].

Inputs and leakage-safe construction.

At a monthly cadence, each training instance at issue time t comprises (i) a univariate count window

[eqn]

and (ii) an auxiliary vector

[eqn]

where $[eqn]$ seasonal harmonics and autoregressive lags are {1,2,3,12}, and the “light” climate set uses up to $[eqn]$ lag-1 features among rainfall, temperature, and relative humidity (Section climate_timing). To prevent look-ahead, (a) only lagged climate is used, (b) harmonics and climate are standardized on training folds only, and (c) model selection and calibration are performed using data available within each training fold (details below).

Architecture.

The count window Ct is passed through two stacked bidirectional LSTM layers with 32 units per direction. Let $[eqn]$ denote the final embedding. A dense block with ReLU activation and dropout (rate 0.2) produces a non-linear summary $[eqn]$ . To retain a short-memory linear structure, we include an autoregressive skip that maps the four AR lags directly to a horizon-specific mean adjustment. The network concatenates the learned representation with auxiliary features.

[eqn]

Negative-binomial output head.

For each horizon $[eqn]$ , the model outputs pre-activations $[eqn]$ and $[eqn]$ via affine maps of zt, with an AR-skip term applied to the mean:

[eqn]

where $[eqn]$ . Positivity and numerical stability are enforced with bounded activations:

[eqn]

with $[eqn]$ , $[eqn]$ , and $[eqn]$ the logistic sigmoid [26]. We adopt NB2 with mean $[eqn]$ and variance $[eqn]$ ; equivalently, $[eqn]$ and $[eqn]$ .

Training objective and optimization.

Let Yt+h be the target for horizon h. The per-instance multi-horizon negative log-likelihood is

[eqn]

where $[eqn]$ . We optimize with Adam (learning rate $[eqn]$ ), gradient-norm clipping ( $[eqn]$ ), early stopping (patience 12 epochs), and ReduceLROnPlateau (factor 0.5, patience 6).

Hyperparameter specification and audit trail.

Given the limited sample size, we pre-specify the BiLSTM-NB configuration a priori rather than performing extensive per-fold hyperparameter search. The network uses two BiLSTM layers (32 units per direction), a dense layer (64 units, ReLU) with L2 regularization ( $[eqn]$ ) and dropout (0.2), and a bounded NB2 dispersion $[eqn]$ . Training uses up to 150 epochs, a batch size of 16, early stopping on a time-ordered validation tail, and the learning-rate schedule above. Within each rolling-origin training fold, we reserve the final 20% of the training window (time-ordered) as a validation tail for early stopping and learning-rate scheduling; no test-era observations are used for model selection. All fixed settings, preprocessing rules, and random seeds are reported for auditability. Hyperparameter details are listed in S1 Table.

Ensembling and calibration.

To stabilize training, we fit an ensemble of M = 5 models using fixed seeds and form an equal-weight mixture predictive distribution by averaging the component NB probability mass functions. To improve reliability without look-ahead, calibration is learned using forecasts generated strictly within the training window of each rolling-origin fold (sequentially) and then applied unchanged to the corresponding test issues. Specifically, we apply a monotone isotonic post-processing map to PIT-based CDF values computed from the ensemble mixture, using only training-window forecasts and realized outcomes [26].

Sensitivity and generalizability analyses

Renewal kernel sensitivity.

To assess the robustness of the mechanistic baseline at monthly resolution, we refit the Renewal-NB model under alternative kernel supports and shapes: (i) support $[eqn]$ months and (ii) kernel shapes, including front-loaded geometric decay $[eqn]$ with $[eqn]$ and diffuse kernels (e.g., discretized gamma) normalized to sum to one. We compare probabilistic scores and calibration diagnostics across kernels.

Climate feature-set sensitivity.

To justify the “light climate” specification, we evaluate an expanded climate feature set (additional lags and/or anomalies) as a sensitivity analysis while preserving leakage-safe timing. Results are reported in the Supporting Information.

Temporal generalizability.

We assess the stability of conclusions under an era-based evaluation by training on an earlier period and evaluating on a later period (details in Results), using the same leakage-safe rolling-origin protocol within the evaluation era.

Experimental setup

We evaluate all models under an expanding-window, rolling-origin design for monthly dengue surveillance (January 2015 to December 2024). Let t index months, and let $[eqn]$ denote all data available up to and including the month t (cases, calendar features, and leakage-safe climate covariates). After a minimum training length of 48 months and once all required lagged features are available, each model is refitted $[eqn]$ and issues probabilistic forecasts for horizons $[eqn]$ months ahead, targeting month t + h. This procedure repeats for every eligible issue month, yielding a sequence of out-of-sample predictive distributions and realized outcomes for scoring.

For every model and eligible (t,h), we record the predictive NB parameters $[eqn]$ , the implied $[eqn]$ with $[eqn]$ and $[eqn]$ , the realized outcome Yt+h, the log predictive score $[eqn]$ , the central 50% and 90% prediction intervals, empirical coverages, interval widths, and randomized PIT values. These per-issue records underpin summary metrics and pairwise Diebold-Mariano tests (Section: metrics).

Because some models may be undefined for certain issue months (e.g., due to lag requirements at the beginning of the series), we report two complementary evaluations: (i) model-wise summaries computed on each model’s available issue-target set and (ii) aligned comparisons that restrict to the intersection of issue-target pairs shared by all models for a given horizon. The aligned set is used for Diebold-Mariano tests to ensure like-for-like comparisons.

Evaluation metrics

We assess probabilistic accuracy, calibration, and sharpness using proper scoring rules and diagnostics for count forecasts [27,28].

Log score (primary).

For a forecast with NB2 predictive distribution $[eqn]$ and observed count yi, the log score is

[eqn]

Using the NB2 pmf,

[eqn]

the mean log score across n forecasts is

[eqn]

Higher $[eqn]$ indicates better probabilistic accuracy. For presentation as a loss, we also report the negative log score $[eqn]$ (smaller is better) [29].

Predictive interval coverage.

Calibration is assessed using empirical coverage of central prediction intervals at nominal levels $[eqn]$ . For each forecast i, the equal-tailed interval is

[eqn]

where $[eqn]$ NB denotes the quantile function. The empirical coverage rate is

[eqn]

Well-calibrated forecasts satisfy $[eqn]$ ; undercoverage, indicating overconfidence, and overcoverage indicates overly diffuse forecasts [30].

Median interval width (sharpness).

Sharpness (conditional on calibration) is summarized by the median width of the $[eqn]$ -level interval:

[eqn]

where

[eqn]

Among similarly calibrated models, smaller $[eqn]$ indicates sharper and more informative predictive distributions [31].

Diebold-Mariano tests with HAC variance.

Pairwise forecast comparisons use the Diebold-Mariano (DM) test for equal predictive accuracy [18]. We define a loss based on the negative log score $[eqn]$ . For two competing models with losses L1,t and L2,t non-aligned issue-target pairs, the loss differential is

[eqn]

The null hypothesis $[eqn]$ is tested using

[eqn]

where $[eqn]$ is a Newey-West heteroskedasticity-and-autocorrelation-consistent (HAC) variance estimator

[eqn]

with $[eqn]$ . We set the HAC bandwidth to $[eqn]$ to reflect serial correlation induced by overlapping hstep-ahead forecasts [32]. Two-sided tests use $[eqn]$ ; negative DM favors model 1 (lower loss) and positive values favor model 2 [33].

Randomized PIT histograms.

Calibration is also assessed via probability integral transform (PIT) diagnostics. For discrete predictive distributions, we use randomized PIT values

[eqn]

where $[eqn]$ are independent and $[eqn]$ with $[eqn]$ [34]. Under calibration, $[eqn]$ . We visualize the empirical distribution of $[eqn]$ using $[eqn]$ equal-width bins. Deviations from uniformity indicate: (i) U-shaped histograms (underdispersed forecasts), (ii) inverse-U (overdispersed), (iii) left-skew (systematic overforecasting), and (iv) right-skew (systematic underforecasting). We optionally supplement visual inspection with a uniformity test (e.g., Anderson-Darling), noting that power may be limited for small samples [35].

Results

We evaluated leakage-safe monthly probabilistic forecasts of dengue cases in Freetown, Sierra Leone, at horizons $[eqn]$ . The candidate models comprise NB-GLM, INGARCH-NB, Renewal-NB, a light climate-augmented NB-GLM variant (NB-GLM+Climate), a light climate-informed renewal variant (Renewal+Climate), and a probabilistic BiLSTM with a negative-binomial observation model (BiLSTM-NB). Performance is summarized using mean log score (primary; higher is better), empirical coverage of nominal 50% and 90% prediction intervals (PIs), and median PI widths (sharpness). Unless stated otherwise, headline comparisons use the global-aligned issue/target set shared across models within each horizon (n = 32 per horizon); additional pairwise-aligned results and model-wise (unaligned) summaries are reported in the Supporting Information.

Data overview and regime characterization

The cleaned monthly dengue series spans January 2015 to December 2024 (120 months, no missing months). The mean monthly incidence is 20.43 cases, and the variance is 331.73, indicating substantial overdispersion (variance-to-mean ratio = 16.24; Table data_summary). Zero-count months account for 18.33% observations (22/120), consistent with intermittent transmission at monthly resolution.

To assess performance under heterogeneous transmission intensity, we stratify evaluation targets into non-outbreak and outbreak regimes using a horizon-specific threshold thr_h_ applied to the realized target yt+h. Specifically, a target month is labeled outbreak if $[eqn]$ , and “non-outbreak” otherwise. Thresholds are taken from the regime experiment and are similar across horizons (thr_1_ = 33.00, thr_2_ = 32.50, thr_3_ = 32.25). This split isolates high-incidence targets and supports interpretation of calibration and upper-tail behavior during intense transmission periods (Table 1).

Table 1: Dataset summary for monthly dengue cases in Freetown, Sierra Leone (cleaned series).

Main aligned probabilistic accuracy across horizons

Table 2 reports performance on the global-aligned evaluation set (naligned = 32 per horizon), ensuring a like-for-like comparison across all models. Across horizons, INGARCH-NB attains the best mean log score, indicating the strongest overall distributional accuracy under strict alignment. BiLSTM-NB is consistently competitive and shows strong 90% PI calibration (including perfect coverage at h = 3), but it does not exceed INGARCH-NB in mean log score on the global-aligned set.

Table 2: Global-aligned probabilistic performance by horizon (higher mean log score is better).

The NB-GLM baseline undercovers markedly at the 90% level (53.1-62.5% across horizons), consistent with under-dispersed predictive distributions under strong overdispersion and changing dynamics. Adding the light climate covariates improves the mean log score relative to NB-GLM at h = 2 and h = 3, but the resulting forecasts remain substantially under-calibrated (Cover90 $[eqn]$ ) and extremely narrow (Width90 $[eqn]$ cases). Renewal-based baselines exhibit the opposite failure mode: Renewal-NB often attains near-nominal or above-nominal 90% coverage, but typically with wider intervals, which reduces sharpness and penalizes log score at longer horizons. The Renewal+Climate (light) variant becomes particularly diffuse at h = 2 (median Width90 = 191), indicating sensitivity of the climate-augmented renewal specification and overly conservative tails in this setting.

Fig 3 visualizes the mean log score ranking by horizon, while Figs 4 and 5 summarize the calibration-sharpness trade-offs consistent with Table 2. Additional diagnostic plots are provided in S1 Fig, S2 Fig and S3 Fig.

Global-aligned mean log score by model and horizon (higher is better).Barplots of the mean log score on the global-aligned set (n = 32 per horizon). INGARCH-NB ranks best across horizons, with BiLSTM-NB competitive; renewal-based variants are penalized by diffuse distributions, and GLM variants by under-dispersion.

Global-aligned 90% PI coverage by model and horizon (closer to 90% is better).Coverage of nominal 90% predictive intervals on the global-aligned set. GLM variants under-cover substantially; BiLSTM-NB attains very high coverage at longer horizons; INGARCH-NB maintains generally good calibration without extreme width inflation.

Global-aligned median 90% PI width by model and horizon (smaller is sharper).Median 90% PI widths highlight uncertainty inflation for renewal-based variants at longer horizons. GLM variants remain narrow but undercover; INGARCH-NB offers a better calibration-sharpness trade-off.

Pairwise significance: Diebold-Mariano tests on aligned samples

To assess whether differences in mean log scores are statistically distinguishable, we use Newey–West/HAC with bandwidth $[eqn]$ , a standard choice for h-step-ahead loss differentials. These samples can be larger than the global-aligned set because alignment is required only between the two models being compared (e.g., n = 63 or n = 69), rather than across all models simultaneously.

INGARCH-NB significantly outperforms NB-GLM at all horizons (p = 0.044 at h = 1; p = 0.036 at h = 2; p = 0.018 at h = 3), consistent with gains from explicitly modeling conditional-mean dynamics in overdispersed monthly counts. INGARCH-NB also significantly outperforms BiLSTM-NB across horizons ( $[eqn]$ ), indicating that, in this limited-sample monthly setting, the dynamic count model yields a higher average log predictive density than the deep sequence model. Conversely, BiLSTM-NB substantially improves log score relative to Renewal+Climate (light) at all horizons ( $[eqn]$ ), consistent with the renewal climate variant producing overly diffuse forecasts (notably at longer horizons) that are penalized under log scoring. Fig 6 summarizes the count of statistically significant wins by horizon (Table 3).

Table 3: Focused Diebold-Mariano tests on pairwise-aligned samples (log score differences). Positive mean difference favors Model 1.

Significant wins by model and horizon (DM tests, p < 0.05).Count of pairwise comparisons in which each model significantly improves mean log score over another model under HAC-robust DM testing. INGARCH-NB achieves the most consistent significant wins across horizons.

Calibration and sharpness of predictive intervals

Operational utility depends on joint calibration (coverage) and sharpness (interval width). The NB-GLM baselines are the sharpest, producing the narrowest 90% predictive intervals across horizons (median width $[eqn]$ –17 cases), but they markedly under-cover at the 90% level (about 53–62% for NB-GLM and 62–72% for NB-GLM+Climate on the global-aligned set), indicating systematic overconfidence. INGARCH-NB provides the best overall log-score accuracy while maintaining generally good calibration without extreme width inflation; this is most evident at h = 2, where 90% coverage reaches 96.9% with a relatively compact median 90% width of 32 cases.

BiLSTM-NB attains a high 90% coverage, reaching 100% at h = 3, but does so with wider intervals (median 90% width 69.5), consistent with more conservative upper-tail behavior. Renewal-based models illustrate a common failure mode in discrete probabilistic forecasting: achieving nominal (or near-nominal) coverage by inflating uncertainty. Renewal-NB produces substantially wider intervals at longer horizons (e.g., median 90% width 124.5 at h = 3), and Renewal+Climate (light) becomes highly diffuse at h = 2 (median 90% width 191.0), which is strongly penalized under log scoring and limits practical decision usefulness despite occasional adequate coverage. Fig 7 summarizes the accuracy-calibration-sharpness trade-off across horizons.

Global-aligned overview: accuracy, calibration, and width across horizons.Line summaries across horizons showing the accuracy-calibration-sharpness trade-off. INGARCH-NB is consistently strong in log score with generally good calibration and moderate widths; GLM variants under-cover; renewal variants inflate width at longer horizons.

Seasonality of skill: month-of-year log score patterns

To probe whether relative forecast skill varies across the calendar year, we computed month-of-year mean log scores on the aligned sample for INGARCH-NB and BiLSTM-NB (Figs 8–10). This analysis is descriptive because the per-month sample sizes are small (typically $[eqn]$ –6 aligned forecasts per month), so month-to-month fluctuations should not be over-interpreted. Nevertheless, a consistent seasonal structure is visible across horizons: both models achieve their best (least negative) log scores in late Q1/early Q2 (roughly months 3–4), while performance tends to deteriorate around mid-year (approximately months 6–8). This mid-year degradation is consistent with periods in which transmission intensity and dispersion may shift more abruptly, making distributional forecasting more challenging at monthly aggregations.

Seasonal log score patterns (aligned): BiLSTM-NB vs INGARCH-NB at h = 1.Month-of-year mean log scores on the aligned set (small per-month counts shown). Both models perform best around months 3–4 and deteriorate mid-year; INGARCH-NB is typically more stable across months.

Seasonal log score patterns (aligned): BiLSTM-NB vs INGARCH-NB at h = 2.Month-of-year comparison at h = 2 shows similar seasonal structure and mid-year difficulty; interpret descriptively due to small per-month sample sizes.

Seasonal log score patterns (aligned): BiLSTM-NB vs INGARCH-NB at h = 3 month-of-year comparison at h = 3 shows persistent mid-year performance degradation; INGARCH-NB generally remains less variable than BiLSTM-NB.

Across months, INGARCH-NB appears more stable, with smaller swings in mean log score and fewer sharp deteriorations in mid-to-late year, whereas BiLSTM-NB shows occasional month-specific advantages (notably around months 3–4) but with greater variability. Given the limited counts per month, these patterns are best viewed as qualitative diagnostics that complement the globally aligned summaries rather than as definitive evidence of month-specific dominance.

Climate signal contribution: the “light climate” experiment

To justify the minimal climate feature set, we evaluated a leakage-safe light climate design using only covariates available at the issue time under conservative timing: lag-1 precipitation, temperature, and relative humidity. Table 4 summarizes a within-family climate ablation for NB-GLM and renewal models on aligned evaluation months.

Table 4: Climate ablation (light climate; aligned on evaluation months). Light climate uses only lag-1 precipitation, temperature, and humidity to remain leakage-safe and deployable when real-time climate products are limited.

For NB-GLM, adding light climate features yields small improvements in mean log score, with only borderline evidence at h = 3 (DM p = 0.0978), and no statistically significant gains at the 5% level across horizons (DM p = 0.312, 0.312, 0.098 for h = 1,2,3). This indicates that while lagged climate may carry some predictive signal, the restricted lag-1-only specification and the linear predictor are insufficient to translate it into reliably improved probabilistic accuracy under strict leakage control. For the mechanistic renewal model, the light climate fit does not improve performance: DM tests favor the non-climate renewal specification at all horizons (p > 0.35), and Renewal + Climate forecasts are typically less sharp, suggesting that the additional climate forcing can destabilize tails (width inflation) without delivering commensurate gains in log-score accuracy. At a monthly cadence, the renewal specification is sensitive to kernel assumptions; mis-specification can manifest as tail inflation

Outbreak vs non-outbreak performance: regime-stratified analysis

To assess robustness under heterogeneous transmission intensity, we stratified the aligned evaluation targets into non-outbreak and outbreak regimes using the horizon-specific thresholds thr_h_ (Table 5). Because the outbreak subset is small (typically $[eqn]$ per horizon), regime-specific rankings should be interpreted as descriptive rather than definitive; a few extreme months can materially shift averages.

Table 5: Regime-stratified probabilistic performance by horizon (aligned evaluation). Outbreak months are defined by yt+h>thrh, where thrh is the horizon-specific threshold used in the regime split. Tail-miss rate is the percentage of times yt+h exceeds the upper 90% predictive interval bound.

Non-outbreak regime

In non-outbreak months, INGARCH-NB yields the best mean log score across horizons (h = 1: −3.52; h = 2: −3.31; h = 3: −3.32), with BiLSTM-NB close behind (Table 5). This pattern is consistent with conditional mean dynamics capturing most of the predictive signal when incidence is moderate. Calibration is acceptable for INGARCH-NB in non-outbreak months (90% coverage $[eqn]$ ), while BiLSTM-NB is more conservative (often reaching 100% 90% coverage) but at wider uncertainty. GLM-based models remain sharp but under the cover in non-outbreak months, indicating persistent overconfidence even outside outbreaks.

Outbreak regime: accuracy-sharpness trade-offs dominate

During outbreaks, rankings differ, and interpretation hinges on the accuracy-sharpness trade-off. GLM-based models can achieve competitive mean log scores at $[eqn]$ with comparatively moderate widths (e.g., NB-GLM+Climate at h = 1: mean log score −3.59, 90% coverage = 83.3%, width90 = 23.0; Table 5). In contrast, INGARCH-NB and BiLSTM-NB often avoid tail misses in outbreaks primarily by issuing much wider intervals (e.g., at h = 1, width90 = 126.5 for INGARCH-NB and =89.5 for BiLSTM-NB), which reduces sharpness and can lower log score unless the realized count falls deep in the upper tail. Therefore, outbreak-month comparisons should not be judged on coverage alone: very high coverage can reflect interval inflation rather than well-targeted uncertainty.

Renewal models show instability under outbreaks

Renewal-NB and especially Renewal+Climate (light) exhibit the most extreme outbreak behavior, producing very large interval widths (e.g., Renewal-NB width 90 = 363.0 at h = 1, = 564.0 at h = 2; Renewal+Climate width90 = 1061.0 at h = 1, = 731.0 at h = 2, and =2563.5 at h = 3) alongside poor mean log scores (Table 5). The near-zero tail-miss rates in several outbreak cells are therefore not evidence of superior calibration; they largely reflect over-diffuse predictive distributions that sacrifice sharpness.

Visual summary of regime effects

Fig 11 summarizes the mean log score by horizon and regime and highlights how some models degrade disproportionately in outbreaks. Complementary horizon-wise regime dashboards (S4-S6 Figs) break down (i) mean log score, (ii) 50%/90% coverage, (iii) interval widths, and (iv) upper-tail miss rates, reinforcing that outbreak performance must be assessed jointly on calibration and sharpness rather than coverage alone. Additional diagnostic plots are provided in S3 Fig, S4 Fig, and S5 Fig.

Regime-stratified mean log score vs horizon.Points show mean log score by model, horizon, and regime (non-outbreak vs outbreak). Higher (less negative) is better. The outbreak subset is small, so dispersion across models is expected.

Generalizability across time: era-based evaluation (2021–2024)

To probe temporal robustness under potential distribution shift, we performed an era-based evaluation by restricting scoring to targets in 2021–2024 and recomputing aligned probabilistic metrics across all models and horizons. Because the light-climate models are available only on a sparser set of issue dates, the aligned intersection for this era is smaller than in the full-period analysis, yielding naligned = 23 at h = 1 and naligned = 24 at h = 2,3 common forecast cases across all models. Table 6 summarizes log-score accuracy (higher is better; values are negative because they are log probabilities) together with interval calibration and sharpness diagnostics.

Table 6: Era-based aligned performance on targets in 2021–2024. Aligned intersection across all models in this era yields naligned = 23 at h = 1 and naligned = 24 at h = 2,3. A higher mean log score indicates better probabilistic accuracy. Coverage is empirical PI coverage; widths are median PI widths.

Across 2021–2024, the main conclusions persist: INGARCH-NB and BiLSTM-NB remain the strongest performers, while GLM-based models yield the narrowest intervals but tend to under-cover, indicating overconfidence. However, restricting to this era induces some horizon-specific reordering among the top methods. BiLSTM-NB attains the best mean log score at h = 1 (mean −3.586) and h = 3 (mean −3.559), whereas INGARCH-NB is best at h = 2 (mean −3.550) and remains comparatively sharp at longer horizons (e.g., width90 = 30.5 at h = 3). Renewal-NB remains well calibrated in this era (90% coverage equals 100% at h = 1 and h = 3, and 95.8% at h = 2) but is less sharp, with wider median predictive intervals, especially at h = 3 (width90 = 130.0).

Pairwise Diebold-Mariano (DM) tests on the same 2021–2024 aligned subset (Table 7) indicate that some differences remain statistically distinguishable despite the reduced sample size. At h = 2, INGARCH-NB outperforms BiLSTM-NB with strong evidence (DM p < 10^-4^). At h = 1, BiLSTM-NB outperforms Renewal+Climate (light) (DM p = 0.0106). At h = 3, BiLSTM-NB outperforms Renewal-NB (DM p = 0.0037), and INGARCH-NB slightly outperforms Renewal-NB (DM p = 0.0481). Other pairwise differences are directionally consistent but not statistically significant at conventional thresholds, which is expected given the smaller aligned sample induced by the climate-available subset.

Table 7: Focused Diebold-Mariano tests on 2021-2024 aligned subset. Positive DM t indicates Model 1 has a higher log score (better).

Fig 12 visualizes the mean log score ranking by horizon in 2021–2024, while Fig 13 summarizes calibration and sharpness on the same aligned subset.

Era-based predictive accuracy (2021-2024).Mean log score by model and horizon on the 2021-2024 aligned subset (higher is better).

Era-based calibration and sharpness (2021-2024).Empirical coverage (50%, 90%) and median predictive-interval widths on the 2021–2024 aligned subset.

Discussion

In resource-constrained settings, forecasts of infectious disease burden increasingly guide preparedness, inform vector-control timing, and support risk communication. Using leakage-safe monthly dengue surveillance from Freetown, Sierra Leone (2015–2024), we evaluated a spectrum of probabilistic forecasting approaches—including a statistical regression baseline (NB-GLM), a dynamic count model (INGARCH-NB), mechanistic renewal models (Renewal-NB and a light climate-informed variant), and a deep sequence model with a negative binomial output (BiLSTM-NB)—under a harmonized expanding-window, rolling-origin design. Our analysis yielded three findings with direct relevance to applied public health. First, under strict global alignment across all models, INGARCH-NB delivered the strongest overall distributional accuracy (highest mean log score) across horizons $[eqn]$ , suggesting that parsimonious conditional mean dynamics can be highly effective at a monthly cadence. Second, BiLSTM-NB was consistently competitive and exhibited strong reliability at the 90% level (notably reaching 100% coverage at h = 3 in the aligned set) but achieved this with wider uncertainty, reflecting more conservative tail behavior. Third, renewal-based specifications were sensitive in this setting: while Renewal-NB often achieved high nominal coverage, it did so with markedly inflated interval widths at longer horizons, and the light climate-informed renewal variant could become extremely diffuse, leading to poor log scores and limited operational sharpness.

Implications for operational decision-making.

Public health programs balance near-term responsiveness (clinical readiness, targeted mobilization) with medium-range planning (vector-control campaigns, community engagement ahead of seasonal upswings). Our results suggest that selecting a single “best” model is inadvisable without considering the decision horizon and the calibration-sharpness trade-off. For short horizons, INGARCH-NB provides strong probabilistic accuracy with moderate interval widths and generally good calibration, making it well-suited for operational triggers that depend on distributional accuracy rather than point forecasts alone. For longer horizons, BiLSTM-NB offers a high 90% reliability but at the cost of wider uncertainty; this may be preferable when the cost of missing high-incidence events outweighs the drawbacks of issuing conservative risk bands. In practice, horizon-specific selection rules or lightweight ensembles can exploit these complementary strengths, but action thresholds (e.g., triggers based on predictive quantiles) should be calibrated to empirical coverage rather than assumed nominal performance.

Climate, seasonality, and feasibility.

We deliberately restricted exogenous inputs to a leakage-safe “light climate” design lag-1 precipitation, temperature, and humidity, to reflect realistic conditions where real-time climate products can be delayed or inconsistent. Under these constraints, climate augmentation yielded only modest improvements for NB-GLM and did not improve renewal forecasts; Diebold-Mariano (DM) tests did not show statistically significant gains at the 5% level. This does not imply that climate is unimportant for dengue transmission. Rather, it suggests that (i) the restricted, lag-only feature set; (ii) the linear structure of the NB-GLM; and (iii) the sensitivity of renewal calibration to specification choices may limit the realized benefit at a monthly resolution. For deployment elsewhere, richer meteorological nowcasts and entomological covariates may improve skill; however, this requires verifying their latency and availability to avoid leakage and explicitly handling missingness and backfilling.

Calibration, sharpness, and reliability.

A key operational lesson is that coverage alone can be misleading. The NB-GLM baselines produced the narrowest intervals but severely under-covered, indicating overconfidence that may bias decisions toward under-preparedness. Conversely, renewal-based forecasts sometimes achieve high coverage largely by inflating uncertainty, which degrades the log score and reduces the practical value of forecasts for targeting interventions. INGARCH-NB and BiLSTM-NB occupied a more useful middle ground: INGARCH-NB combined strong log scores with generally adequate calibration and comparatively compact intervals (especially at h = 2), while BiLSTM-NB emphasized tail reliability at longer horizons. These patterns underscore the importance of monitoring calibration jointly with sharpness (e.g., coverage alongside interval width and PIT diagnostics) when forecasting is used to trigger resource allocation.

Heterogeneity and outbreak conditions.

Regime-stratified analyses highlight that assessing performance during outbreak months is difficult at a monthly cadence due to the small and highly influential nature of the outbreak subset. Nevertheless, the results illustrate a consistent phenomenon: some models avoid upper-tail misses during outbreaks primarily by issuing very wide predictive intervals, a strategy that does not necessarily indicate superior calibration. During non-outbreak months, INGARCH-NB remained the strongest in mean log score across horizons, while in outbreaks, rankings were more variable and tightly coupled to the accuracy-sharpness trade-off. This motivates a conservative interpretation of outbreak-specific rankings and suggests that prospective use should incorporate safeguards (e.g., horizon-specific uncertainty monitoring and explicit rules for when to trigger heightened preparedness).

Interpretability versus performance.

Mechanistic renewal models remain appealing because they support epidemiological interpretation through Rt-like constructs. However, at a monthly resolution, the assumed serial kernel and aggregation choices can materially affect both mean predictions and uncertainty. Furthermore, misspecification can induce diffuse tails that are heavily penalized by proper scoring rules. A pragmatic compromise for early warning is dual reporting: operational probabilistic forecasts derived from the best-performing statistical or deep learning model, paired with mechanistic summaries (e.g., renewal-basedRt trajectories) for interpretability and situational awareness, treating the latter cautiously when calibration diagnostics indicate instability.

Comparison with prior work.

The horizon-dependent trade-offs we observe align with broader forecasting evidence: autoregressive count models often excel at shorter leads; flexible sequence models can maintain reliability as horizons extend; and mechanistic models require careful specification and appropriately resolved data to be competitive. Importantly, our evaluation design emphasized aligned issue-target comparisons and leakage controls, reducing the risk of overstating gains from exogenous covariates or complex architectures.

Strengths and limitations.

Strengths of this study include a unified probabilistic evaluation (using the mean log score as the primary metric), explicit calibration and sharpness diagnostics, leakage-safe handling of climate covariates, and aligned backtests that support fair comparisons. However, limitations are notable. First, results are based on a single city and monthly data; generalizability across settings, reporting practices, and spatial heterogeneity remains to be established. Second, the light climate design may underutilize environmental information, but it reflects a deliberate feasibility constraint. Third, we did not explicitly model reporting delays or structural breaks; operational systems may benefit from adaptive schemes (e.g., change-point detection or robust retraining triggers). Fourth, the outbreak subset in regime stratification is small, so outbreak-specific conclusions should be considered descriptive.

Guidance for scale-up.

For agencies considering implementation, we suggest: (1) adopting horizon-aware deployment (e.g., INGARCH-NB as a strong default, and BiLSTM-NB when tail reliability at longer horizons is prioritized); (2) synchronizing forecast issuance with operational decision calendars; (3) monitoring calibration online via rolling coverage and PIT dashboards and retraining when deviations persist; (4) setting action thresholds using retrospective empirical coverage to mitigate overconfidence; and (5) auditing climate feed latency and backfill behavior before expanding exogenous inputs. These steps align methodological rigor with institutional capacity and support equitable uptake.

Future directions.

Three methodological avenues appear promising: (i) hybrid renewal-RNN models that retain mechanistic interpretability while learning residual structure; (ii) probabilistic ensembling across model classes to leverage complementary strengths; and (iii) hierarchical sharing across districts to improve data efficiency and spatial generalization. Substantively, integrating vector surveillance, mobility proxies, and high-resolution climate nowcasts could enhance predictive skill, provided leakage safeguards and missing-data policies remain central. Finally, prospective evaluations (including silent trials and decision-impact studies) should accompany rollout to verify that forecast use leads to improved outcomes without exacerbating inequities.

In summary, leakage-safe probabilistic dengue forecasts at a monthly cadence can be operationally useful, but model choice should be guided by the decision horizon and the calibration-sharpness trade-off, rather than by accuracy metrics alone. A horizon-aware portfolio that prioritizes strong distributional accuracy at shorter horizons and reliable tails at longer ones offers a practical path for early warning in settings like Freetown, with broader validation and responsible systems integration representing the critical next steps.

Conclusion

We developed and compared leakage-safe probabilistic dengue forecasting models for Freetown, Sierra Leone (2015–2024) at a monthly cadence. Our study spanned statistical count models (NB–GLM, INGARCH–NB), a mechanistic renewal model (Renewal–NB), and a deep sequence model with a negative binomial output (BiLSTM–NB), evaluated under an expanding-window, rolling-origin design. Using aligned evaluation sets and proper scoring rules, we found that INGARCH-NB achieved the strongest overall distributional accuracy on the global aligned set across horizons $[eqn]$ , Meanwhile, BiLSTM-NB remained competitive, delivering particularly reliable 90% predictive interval coverage at longer horizons, albeit with wider intervals. In contrast, NB-GLM variants tended to be overconfident (under-covered), whereas renewal-based specifications attained nominal coverage primarily through uncertainty inflation, which reduced sharpness and penalized log scores. A leakage-safe “light climate” design, incorporating lag-1 precipitation, temperature, and humidity, yielded modest, model-dependent improvements for NB-GLM, though these were not statistically significant at conventional levels and did not improve renewal forecasts.

Operationally, these findings support a horizon-aware forecasting strategy: INGARC-NB serves as a strong default for near- and medium-term planning where distributional accuracy and moderate sharpness are required, while BiLSTM-NB offers a complementary option when conservative tail reliability is prioritized at longer horizons. Study limitations include the monthly temporal resolution, the focus on a single urban setting, and the omission of spatial structure and immunity or serotype dynamics. Future work should evaluate higher-frequency data, multi-city transferability, hierarchical and hybrid (mechanistic-learning) models, and prospective real-time pipelines linked to explicit public health decision triggers. Overall, this study demonstrates that principled probabilistic forecasting with leakage controls and aligned evaluation can provide actionable, uncertainty-aware dengue guidance for public health practice.

Supporting information

S1 FigPIT histograms by model and horizon (aligned evaluation).Probability integral transform (PIT) histograms for aligned forecasts by model and horizon. Deviations from uniformity indicate miscalibration (e.g., over- or under-dispersion and systematic bias).(TIF)

S2 FigModel performance heatmap (aligned evaluation).Displays mean log scores across models and horizons.(TIF)

S3 FigModel performance heatmap (Annotations).Numerical overlays represent empirical 90% predictive-interval coverage for calibrated uncertainty assessment.(TIF)

S4 FigRegime dashboard at horizon h = 1 (aligned evaluation).Diagnostics stratified by regime (non-outbreak vs outbreak) for 1-month-ahead targets: (i) mean log score (higher/less negative is better), (ii) empirical coverage of nominal 50% and 90% predictive intervals, (iii) median predictive-interval widths (50%, 90%), and (iv) upper-tail miss rate (percentage of targets exceeding the upper 90% PI bound). Outbreak months are defined by $[eqn]$ with thr_1_ = 33.00. The outbreak subset is small; interpret descriptively and jointly with interval widths (high coverage may reflect diffuse forecasts).(TIF)

S5 FigRegime dashboard at horizon h = 2 (aligned evaluation).Same diagnostics as S4 Fig for 2-month-ahead targets, with outbreaks defined by $[eqn]$ and thr_2_ = 32.50. Highlights horizon-dependent changes in calibration and sharpness under outbreaks.(TIF)

S6 FigRegime dashboard at horizon h = 3 (aligned evaluation).Same diagnostics as S4 Fig for 3-month-ahead targets, with outbreaks defined by $[eqn]$ and thr_3_ = 32.25. At longer horizons, sharpness differences can be substantial; near-zero tail-miss rates during outbreaks may coincide with excessively wide intervals.(TIF)

S1 TableHyperparameter settings for all models.Summary of final model configurations used in the main experiments (e.g., GLM covariates/penalty if any, INGARCH order and link, renewal kernel and seasonal Rt specification, BiLSTM architecture/training settings, and calibration settings). (Provided in a separate upload.).(DOCX)

S1 DataDe-identified monthly dengue cases and climate aggregates (2015–2024).Available at: https://doi.org/10.34740/kaggle/dsv/13257213(CSV)

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bhatt S, Gething PW, Brady OJ, Messina JP, Farlow AW, Moyes CL, et al. The global distribution and burden of dengue. Nature. 2013;496(7446):504–7. doi: 10.1038/nature 12060 23563266 PMC 3651993 · doi ↗ · pubmed ↗
2Brady OJ, Hay SI. The Global Expansion of Dengue: How Aedes aegypti Mosquitoes Enabled the First Pandemic Arbovirus. Annu Rev Entomol. 2020;65:191–208. doi: 10.1146/annurev-ento-011019-024918 31594415 · doi ↗ · pubmed ↗
3Messina JP, Brady OJ, Golding N, Kraemer MUG, Wint GRW, Ray SE, et al. The current and future global distribution and population at risk of dengue. Nat Microbiol. 2019;4(9):1508–15. doi: 10.1038/s 41564-019-0476-8 31182801 PMC 6784886 · doi ↗ · pubmed ↗
4Amarasinghe A, Kuritsk JN, Letson GW, Margolis HS. Dengue virus infection in Africa. Emerg Infect Dis. 2011;17(8):1349–54. doi: 10.3201/eid 1708.101515 21801609 PMC 3381573 · doi ↗ · pubmed ↗
5Stoler J, Al Dashti R, Anto F, Fobil JN, Awandare GA. Deconstructing “malaria”: West Africa as the next front for dengue fever surveillance and control. Acta Trop. 2014;134:58–65. doi: 10.1016/j.actatropica.2014.02.017 24613157 · doi ↗ · pubmed ↗
6Baba M, Villinger J, Masiga DK. Repetitive dengue outbreaks in East Africa: A proposed phased mitigation approach may reduce its impact. Rev Med Virol. 2016;26(3):183–96. doi: 10.1002/rmv.1877 26922851 · doi ↗ · pubmed ↗
7Dariano IIIDF, Taitt CR, Jacobsen KH, Bangura U, Bockarie AS, Bockarie MJ. Surveillance of vector-borne infections (chikungunya, dengue, and malaria) in Bo, Sierra Leone, 2012–2013. The American Journal of Tropical Medicine and Hygiene. 2017;97(4):1151. doi: 10.4269/ajtmh.16-079829031286 PMC 5637587 · doi ↗ · pubmed ↗
8Campbell AK, Omah IF, Diouf AM, Ndiaye M, Campbell JS, Parker E. First report of dengue virus in Sierra Leone: implications for arbovirus surveillance and control. Research Square. 2025. doi: 10.21203/rs.3.rs-7767082/v 1 · doi ↗