# Machine learning models for delineating marine microbial taxa

**Authors:** Stilianos Louca

PMC · DOI: 10.1093/nargab/lqaf090 · 2025-06-19

## TL;DR

This paper uses machine learning to classify marine microbial genomes, helping to better understand and identify new microbial taxa based on genetic differences.

## Contribution

The study introduces machine learning models that accurately delineate marine microbial taxa using genome similarity metrics.

## Key findings

- Machine learning classifiers achieved over 92% balanced accuracy in delineating marine microbial taxa.
- Gene categories related to cofactor and vitamin metabolism are strongly correlated with taxon divergence.
- Over half of marine prokaryotic phyla, classes, and orders have been identified through metagenomic surveys.

## Abstract

The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.

## Full-text entities

- **Chemicals:** KEGG (-)
- **Species:** Flavobacteriales (order) [taxon 200644]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12204397/full.md

---
Source: https://tomesphere.com/paper/PMC12204397