# Proof-of-Concept Machine Learning Framework for Arboviral Disease Classification Using Literature-Derived Synthetic Data: Methodological Development Preceding Clinical Validation

**Authors:** Elí Cruz-Parada, Guillermina Vivar-Estudillo, Laura Pérez-Campos Mayoral, María Teresa Hernández-Huerta, Alma Dolores Pérez-Santiago, Carlos Romero-Diaz, Eduardo Pérez-Campos Mayoral, Iván A. García Montalvo, Lucia Martínez-Martínez, Héctor Martínez-Ruiz, Idarh Matadamas, Miriam Emily Avendaño-Villegas, Margarito Martínez Cruz, Hector Alejandro Cabrera-Fuentes, Aldo-Eleazar Pérez-Ramos, Eduardo Lorenzo Pérez-Campos, Carlos Mauricio Lastre-Domínguez

PMC · DOI: 10.3390/healthcare14020247 · 2026-01-19

## TL;DR

A machine learning framework was developed to classify arboviral diseases using synthetic data, showing strong performance in distinguishing diseases like Dengue and Influenza.

## Contribution

A novel proof-of-concept ML framework using synthetic data for arboviral disease classification is proposed and validated.

## Key findings

- The synthetic dataset aligns with PAHO guidelines and mirrors real-world arboviral databases.
- The Narrow Neural Network model achieved high accuracy (0.92) and AUC (above 0.98) in classifying arboviral diseases.
- The model reliably distinguishes Dengue from Influenza but shows slightly lower performance between Zika and Chikungunya.

## Abstract

What are the main findings?
Extraction and selection of features from 67 symptoms using binary coding.Model of classification for arboviral diseases using different methods based on machine learning and deep learning.

Extraction and selection of features from 67 symptoms using binary coding.

Model of classification for arboviral diseases using different methods based on machine learning and deep learning.

What are the implications of the main findings?
Conducts rigorous statistical analysis of data to identify symptoms more prevalent for different arboviral diseases using Odds Ratio and Chi-square.Performance evaluation using metrics such as F1-score, accuracy, precision, sensitivity, specificity, AUC-ROC, and Cohen’s kappa.

Conducts rigorous statistical analysis of data to identify symptoms more prevalent for different arboviral diseases using Odds Ratio and Chi-square.

Performance evaluation using metrics such as F1-score, accuracy, precision, sensitivity, specificity, AUC-ROC, and Cohen’s kappa.

Background/Objectives: Arboviral diseases share common vectors, geographic distribution, and symptoms. Developing Machine Learning diagnostic tools for co-circulating arboviral diseases faces data-scarcity challenges. This study aimed to demonstrate that proof of concept using synthetic data can establish computational feasibility and guide future real-world validation efforts. Methods: We assembled a synthetic dataset of 28,000 records, with 7000 for each disease—Dengue, Zika, and Chikungunya—plus Influenza as a negative control. These records were obtained from the existing literature. A binary matrix with 67 symptoms was created for detailed statistical analysis using Odds Ratios, Chi-Square, and symptom-specific conditional prevalence to validate the clinical relevance of the simulated data. This dataset was used to train and evaluate various algorithms, including Multi-Layer Perceptron (MLP), Narrow Neural Network (NN), Quadratic Support Vector Machine (QSVM), and Bagged Tree (BT), employing multiple performance metrics: accuracy, precision, sensitivity, specificity, F1-score, AUC-ROC, and Cohen’s kappa coefficient. Results: The dataset aligns with the PAHO guidelines. Similar findings are observed in other arboviral databases, confirming the validity of the synthetic dataset. A notable performance across all evaluated metrics was observed. The NN model achieved an overall accuracy of 0.92 and an AUC above 0.98, with precision, sensitivity, and specificity values exceeding 0.85, and an average Uniform Cohen’s Kappa of 0.89, highlighting its ability to reliably distinguish between Dengue and Influenza, with a slight decrease between Zika and Chikungunya. Conclusions: These models could accelerate early diagnosis of arboviral diseases by leveraging encoded symptom features for Machine Learning and Deep Learning approaches, serving as a support tool in regions with limited healthcare access without replacing clinical medical expertise.

## Linked entities

- **Diseases:** Dengue (MONDO:0005502), Zika (MONDO:0018661), Chikungunya (MONDO:0017941), Influenza (MONDO:0005812)

## Full-text entities

- **Diseases:** Dengue (MESH:D003715), Influenza (MESH:D007251), Arboviral Disease (MESH:D004671), Zika (MESH:D000071243)

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12840642/full.md

---
Source: https://tomesphere.com/paper/PMC12840642