# Multi-proteins similarity-based sampling to select representative genomes from large databases

**Authors:** Rémi-Vinh Coudert, Jean-Philippe Charrier, Frédéric Jauffrit, Jean-Pierre Flandrois, Céline Brochier-Armanet

PMC · DOI: 10.1186/s12859-025-06095-3 · 2025-05-06

## TL;DR

This paper introduces MPS-Sampling, a new method to efficiently select representative genomes from large databases using protein similarity, avoiding biases from taxonomy or phylogenetic trees.

## Contribution

MPS-Sampling is a novel, scalable genome sampling method based on homologous protein families, avoiding taxonomic and phylogenetic biases.

## Key findings

- MPS-Sampling successfully generated representative genome sets from 178,203 bacterial genomes using 48 ribosomal protein families.
- Selected genomes were taxonomically and phylogenetically representative of the full dataset.
- The method is computationally efficient and avoids biases from traditional taxonomic or tree-based approaches.

## Abstract

Genome sequence databases are growing exponentially, but with high redundancy and uneven data quality. For these reasons, selecting representative subsets of genomes is an essential step for almost all studies. However, most current sampling approaches are biased and unable to process large datasets in a reasonable time.

Here we present MPS-Sampling (Multiple-Protein Similarity-based Sampling), a fast, scalable, and efficient method for selecting reliable and representative samples of genomes from very large datasets. Using families of homologous proteins as input, MPS-Sampling delineates homogeneous groups of genomes through two successive clustering steps. Representative genomes are then selected within these groups according to predefined or user-defined priority criteria.

MPS-Sampling was applied to a dataset of 48 ribosomal protein families from 178,203 bacterial genomes to generate representative genome sets of various size, corresponding to a sampling of 32.17% down to 0.3% of the complete dataset. An in-depth analysis shows that the selected genomes are both taxonomically and phylogenetically representative of the complete dataset, demonstrating the relevance of the approach.

MPS-Sampling provides an efficient, fast and scalable way to sample large collections of genomes in an acceptable computational time. MPS-Sampling does not rely on taxonomic information and does not require the inference of phylogenetic trees, thus avoiding the biases inherent in these approaches. As such, MPS-Sampling meets the needs of a growing number of users.

The online version contains supplementary material available at 10.1186/s12859-025-06095-3.

## Full-text entities

- **Diseases:** MPS (MESH:C536318)
- **Chemicals:** TaxSampler (-)
- **Species:** Cyanobacteriota (blue-green algae, phylum) [taxon 1117], Bacteroidota (Bacteroides-Cytophaga-Flexibacter group, phylum) [taxon 976], Pseudomonadota (proteobacteria, phylum) [taxon 1224], Bacteroidia (class) [taxon 200643], Spirochaetia (class) [taxon 203692], Bacillus (genus) [taxon 55087], Planctomycetota (phylum) [taxon 203682], Enterobacteriaceae (enterobacteria, family) [taxon 543], Bacteria Latreille et al. 1825 (Bacteria stick insect, genus) [taxon 629395], Terriglobia (class) [taxon 204432], Actinomycetota (actinobacteria, phylum) [taxon 201174], Bacillota (clostridial firmicutes, phylum) [taxon 1239], Chloroflexota (GNS bacteria, phylum) [taxon 200795]

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12057276/full.md

---
Source: https://tomesphere.com/paper/PMC12057276