Do Read Errors Matter for Genome Assembly?

Ilan Shomorony; Thomas Courtade; and David Tse

arXiv:1501.06194·cs.IT·January 27, 2015

Do Read Errors Matter for Genome Assembly?

Ilan Shomorony, Thomas Courtade, and David Tse

PDF

TL;DR

This paper investigates how high error rates in long-read sequencing technologies impact genome assembly, establishing a critical read length threshold for perfect assembly and validating it on real genomes.

Contribution

It introduces an adversarial erasure error model and derives a critical read length for perfect genome assembly considering high error rates.

Findings

01

Critical read length is close to that for error-free reads in real genomes.

02

High error rates do not significantly increase the read length needed for perfect assembly.

03

The model provides a theoretical foundation for understanding error impacts in genome assembly.

Abstract

While most current high-throughput DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies generate long reads with high error rates. A basic question of interest is the tradeoff between read length and error rate in terms of the information needed for the perfect assembly of the genome. Using an adversarial erasure error model, we make progress on this problem by establishing a critical read length, as a function of the genome and the error rate, above which perfect assembly is guaranteed. For several real genomes, including those from the GAGE dataset, we verify that this critical read length is not significantly greater than the read length required for perfect assembly from reads without errors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.