A probabilistic analysis of shotgun sequencing for metagenomics
Marlee Herring

TL;DR
This paper provides a probabilistic framework to determine the minimal read length needed for successful genome reconstruction in metagenomics, establishing thresholds for reliable assembly based on the number of genomes and read length.
Contribution
It introduces a probabilistic model analyzing the identifiability of multiple genomes and derives thresholds for successful reconstruction in shotgun sequencing.
Findings
Successful reconstruction when read length exceeds the threshold
Reconstruction impossible below the lower threshold
Thresholds depend on the number of genomes and genome length
Abstract
Genome sequencing is the basis for many modern biological and medicinal studies. With recent technological advances, metagenomics has become a problem of interest. This problem entails the analysis and reconstruction of multiple DNA sequences from different sources. Shotgun genome sequencing works by breaking up long DNA sequences into shorter segments called reads. Given this collection of reads, one would like to reconstruct the original collection of DNA sequences. For experimental design in metagenomics, it is important to understand how the minimal read length necessary for reliable reconstruction depends on the number and characteristics of the genomes involved. Utilizing simple probabilistic models for each DNA sequence, we analyze the identifiability of collections of M genomes of length N in an asymptotic regime in which N tends to infinity and M may grow with N. Our first main…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Gene expression and cancer classification
