Information Theory of DNA Shotgun Sequencing
Abolfazl Motahari, Guy Bresler, David Tse

TL;DR
This paper establishes fundamental limits on DNA sequence reconstruction using shotgun sequencing, revealing a phase transition based on read length and sequence entropy, and analyzing the effects of noise.
Contribution
It introduces an information-theoretic framework for DNA sequencing, identifying critical thresholds for successful assembly based on read length and sequence statistics.
Findings
Reconstruction is impossible below a critical read length threshold.
Above the threshold, sufficient coverage guarantees reconstruction.
Noise impacts the minimum read length needed for reliable assembly.
Abstract
DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is: given a sequencing technology and the statistics of the DNA sequence, what is the minimum number of reads required for reliable reconstruction? This number provides a fundamental limit to the performance of {\em any} assembly algorithm. For a simple statistical model of the DNA sequence and the read process, we show that the answer admits a critical phenomena in the asymptotic limit of long DNA sequences: if the read length is below a threshold, reconstruction is impossible no matter how many reads are observed, and if the read length is above the threshold, having enough reads to cover the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Fractal and DNA sequence analysis · RNA and protein synthesis mechanisms
