Theoretical Bounds on Mate-Pair Information for Accurate Genome Assembly
Henry Lin

TL;DR
This paper establishes theoretical bounds on the number of mate-pair libraries needed for accurate genome assembly, showing it is feasible with a logarithmic number under certain conditions, and provides insights into the problem's complexity.
Contribution
It offers the first theoretical analysis quantifying the number of mate-pair libraries required for accurate genome assembly based on repetitive region length.
Findings
Accurate assembly requires roughly R/2L mate-pair libraries in worst-case scenarios.
A simple polynomial-time algorithm can assemble the genome with (R/L)+1 libraries.
Under certain conditions, only O(log(R/L)) libraries are needed for guaranteed correctness.
Abstract
Over the past two decades, a series of works have aimed at studying the problem of genome assembly: the process of reconstructing a genome from sequence reads. An early formulation of the genome assembly problem showed that genome reconstruction is NP-hard when framed as finding the shortest sequence that contains all observed reads. Although this original formulation is very simplistic and does not allow for mate-pair information, subsequent formulations have also proven to be NP-hard, and/or may not be guaranteed to return a correct assembly. In this paper, we provide an alternate perspective on the genome assembly problem by showing genome assembly is easy when provided with sufficient mate-pair information. Moreover, we quantify the number of mate-pair libraries necessary and sufficient for accurate genome assembly, in terms of the length of the longest repetitive region within a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Genome Rearrangement Algorithms · Chromosomal and Genetic Variations
