Commet: comparing and combining multiple metagenomic datasets
Maillet Nicolas, Collet Guillaume, Vanier Thomas, Lavenier Dominique,, Pierre Peterlongo

TL;DR
Commet is a scalable method for comparing multiple large metagenomic datasets directly from raw reads, enabling similarity analysis, clustering, and visualization without assembly.
Contribution
It introduces an efficient indexing and bit vector-based approach for large-scale metagenomic dataset comparison and clustering, overcoming previous scalability limitations.
Findings
Enables all-against-all comparison of large metagenomic datasets
Provides a compressed representation of read files for efficient analysis
Facilitates visualization of dataset similarities through clustering
Abstract
Metagenomics offers a way to analyze biotopes at the genomic level and to reach functional and taxonomical conclusions. The bio-analyzes of large metagenomic projects face critical limitations: complex metagenomes cannot be assembled and the taxonomical or functional annotations are much smaller than the real biological diversity. This motivated the development of de novo metagenomic read comparison approaches to extract information contained in metagenomic datasets. However, these new approaches do not scale up large metagenomic projects, or generate an important number of large intermediate and result files. We introduce Commet ("COmpare Multiple METagenomes"), a method that provides similarity overview between all datasets of large metagenomic projects. Directly from non-assembled reads, all against all comparisons are performed through an efficient indexing strategy. Then,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Gut microbiota and health · Bioinformatics and Genomic Networks
