# Evaluating machine learning approaches for host prediction using H3 influenza genomic data

**Authors:** Hoc Tran, Olaf Berke, Nicole Ricker, Zvonimir Poljak, Victor Huber, Victor Huber, Victor Huber

PMC · DOI: 10.1371/journal.pone.0336142 · PLOS One · 2025-11-05

## TL;DR

This study uses machine learning on all eight segments of H3 influenza genomes to accurately predict host species and identify cross-species transmission patterns.

## Contribution

The novelty lies in combining all eight IAV segments with machine learning for precise host prediction and transmission analysis.

## Key findings

- Models achieved high accuracy (0.995–0.997) and κ values (0.984–0.990) in host prediction across all eight segments.
- Misclassified sequences with high predicted probabilities were linked to between-species transmission events.
- Case study results aligned with literature, showing model consistency in identifying transmission patterns.

## Abstract

H3 influenza A viruses (IAV) have been shown to frequently cross the species barrier which can be an important factor in sustained transmission and spread. Machine learning methods have been widely explored for host prediction of IAV using genomic data; however, this is often done using data from only one of the eight IAV segments or by using all available IAV data to predict broad categories of hosts.

The objective of this study was to combine machine learning algorithms with H3 IAV sequence data from all eight segments to train predictive machine learning models for distinct host prediction and validate model performance.

Models were trained on both k-mers and amino acid properties alongside machine learning algorithms that included random forest and XGBoost for each of the eight IAV genome segments. Models were then validated on a test dataset through analytics of model class predicted probabilities and subsequently used to investigate between-species transmission patterns within case studies including canine H3N8, swine H3N2 2010.2, and duck H3 sequences.

Models demonstrated strong performance in host prediction across all eight segments on the test dataset, with overall accuracies and κ (kappa) values ranging from 0.995–0.997, 0.984–0.990, respectively. Misclassified test dataset sequences with high predicted probabilities (> 90%) were validated using available literature and were identified to be frequently associated with between-species transmission events. Between-species transmission patterns within case study model class predicted probabilities were also identified to be consistent with the literature in cases of both correct and incorrect classification.

These models allow for rapid and accurate host prediction of H3 IAV datasets from any of the eight IAV segments and provide a solid framework that allows for identification of variants with higher than typical between-species transmission potential. However, results obtained on selected case studies suggest further improvements of the training and validation processes should be considered.

## Linked entities

- **Diseases:** influenza (MONDO:0005812)

## Full-text entities

- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615], Sus scrofa (pig, species) [taxon 9823], H3N2 subtype (serotype) [taxon 119210]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12588535/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12588535/full.md

## References

86 references — full list in the complete paper: https://tomesphere.com/paper/PMC12588535/full.md

---
Source: https://tomesphere.com/paper/PMC12588535