Multi-proteins similarity-based sampling to select representative genomes from large databases
Rémi-Vinh Coudert, Jean-Philippe Charrier, Frédéric Jauffrit, Jean-Pierre Flandrois, Céline Brochier-Armanet

TL;DR
This paper introduces MPS-Sampling, a new method to efficiently select representative genomes from large databases using protein similarity, avoiding biases from taxonomy or phylogenetic trees.
Contribution
MPS-Sampling is a novel, scalable genome sampling method based on homologous protein families, avoiding taxonomic and phylogenetic biases.
Findings
MPS-Sampling successfully generated representative genome sets from 178,203 bacterial genomes using 48 ribosomal protein families.
Selected genomes were taxonomically and phylogenetically representative of the full dataset.
The method is computationally efficient and avoids biases from traditional taxonomic or tree-based approaches.
Abstract
Genome sequence databases are growing exponentially, but with high redundancy and uneven data quality. For these reasons, selecting representative subsets of genomes is an essential step for almost all studies. However, most current sampling approaches are biased and unable to process large datasets in a reasonable time. Here we present MPS-Sampling (Multiple-Protein Similarity-based Sampling), a fast, scalable, and efficient method for selecting reliable and representative samples of genomes from very large datasets. Using families of homologous proteins as input, MPS-Sampling delineates homogeneous groups of genomes through two successive clustering steps. Representative genomes are then selected within these groups according to predefined or user-defined priority criteria. MPS-Sampling was applied to a dataset of 48 ribosomal protein families from 178,203 bacterial genomes to…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Genetic diversity and population structure · Microbial Community Ecology and Physiology
