# Alignment-Free Machine Learning Serotype Classification of the Dengue Virus

**Authors:** Vladimir Gajdov, Isidora Prosic, Mihaela Kavran, Filip Bosilkov, Tamas Petrovic, Jelena Konstantinov, Gospava Lazic

PMC · DOI: 10.3390/v18030280 · Viruses · 2026-02-25

## TL;DR

This paper introduces a fast, alignment-free machine learning method for accurately classifying dengue virus serotypes, even with short or error-prone sequences.

## Contribution

A novel alignment-free Random Forest framework using 3-mer composition features for accurate and scalable dengue serotyping.

## Key findings

- The model achieved near-perfect accuracy and macro-F1 scores on internal test sets.
- It maintained 100% accuracy on strictly independent external datasets.
- The method is robust to sequence truncation and ambiguous nucleotides.

## Abstract

Dengue virus (DENV) serotyping is essential for epidemiological surveillance, clinical risk assessment, and vaccine evaluation, as the four dengue serotypes differ in pathogenicity, immune interactions, and population dynamics. Existing subtyping methods largely rely on sequence alignment and phylogenetic inference, which can be computationally intensive and unreliable for short, fragmented, or error-prone sequences commonly generated in diagnostic and surveillance settings. There is a need for fast, alignment-free serotyping approaches that maintain high accuracy across heterogeneous sequence lengths while remaining scalable, transparent, and suitable for real-world diagnostic inputs. We demonstrate that compact 3-mer composition features are sufficient for highly accurate dengue virus serotyping when coupled with a lineage-aware Random Forest classification framework. Using 64 normalized 3-mer frequency features per sequence with ambiguity masking and enforcing strict cluster-aware validation at both 99% and 95% nucleotide identity thresholds, our approach achieved near-perfect accuracy and macro-F1 scores on held-out internal test sets. To further ensure independence, external validation datasets were filtered to remove exact sequence matches and any sequences sharing ≥99% or ≥95% nucleotide identity with internal data. On these strictly independent external datasets, the model maintained 100% accuracy and macro-F1 performance, confirming robust generalization beyond database redundancy. Robustness analyses showed stable performance under contiguous sequence truncation down to 300 bp and in the presence of ambiguous nucleotides, indicating resilience to realistic diagnostic inputs. These results demonstrate that a lightweight, alignment-free, machine learning approach can rival alignment-dependent methods while maintaining strict lineage-aware evaluation controls. The proposed framework combines high predictive accuracy, probabilistic reliability, computational efficiency, and reproducible validation design, making it well suited for large-scale genomic surveillance, rapid pre-screening, and diagnostic decision-support applications.

## Linked entities

- **Diseases:** dengue (MONDO:0005502)
- **Species:** Dengue virus (taxon 12637)

## Full-text entities

- **Species:** Dengue virus (no rank) [taxon 12637]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13030380/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13030380/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/PMC13030380/full.md

---
Source: https://tomesphere.com/paper/PMC13030380