Optimal Assembly for High Throughput Shotgun Sequencing
Guy Bresler, Ma'ayan Bresler, David Tse

TL;DR
This paper develops an optimal framework for genome assembly in shotgun sequencing, establishing bounds on read length and coverage, and designing algorithms that nearly achieve these bounds for various genomes.
Contribution
It introduces a theoretical framework with bounds and an assembly algorithm that approaches optimality based on genome repeat statistics.
Findings
Derived lower bounds on read length and coverage for complete genome reconstruction.
Designed a de Brujin graph-based assembly algorithm close to theoretical bounds.
Validated the approach on diverse genome datasets, including GAGE datasets.
Abstract
We present a framework for the design of optimal assembly algorithms for shotgun sequencing under the criterion of complete reconstruction. We derive a lower bound on the read length and the coverage depth required for reconstruction in terms of the repeat statistics of the genome. Building on earlier works, we design a de Brujin graph based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes, including the GAGE datasets. The results are based on a set of necessary and sufficient conditions on the DNA sequence and the reads for reconstruction. The conditions can be viewed as the shotgun sequencing analogue of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by Hybridization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Chromosomal and Genetic Variations · Genome Rearrangement Algorithms
