Large-scale Machine Learning for Metagenomics Sequence Classification
K\'evin Vervier (CBIO), Pierre Mah\'e, Maud Tournoud, Jean-Baptiste, Veyrieras, Jean-Philippe Vert (CBIO)

TL;DR
This paper explores large-scale machine learning methods for rapid and accurate taxonomic classification of metagenomic sequencing reads, demonstrating competitive accuracy and significant speed advantages over traditional alignment-based methods.
Contribution
It introduces a scalable machine learning approach for metagenomic read classification that handles large datasets and compares favorably with alignment tools in speed and accuracy.
Findings
Machine learning models benefit from increased reference genome coverage and k-mer size.
Models trained on 10^8 samples achieve accuracy comparable to alignment tools for small to moderate species sets.
Compositional methods are faster but less effective with many species and high sequencing errors.
Abstract
Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Due to the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions. In this work, we investigate the potential of modern, large-scale machine learning implementations for taxonomic affectation of next-generation sequencing reads based on their k-mers profile. We show that machine learning-based compositional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Microbial Community Ecology and Physiology · Machine Learning in Bioinformatics
