On the Coverage Required for Diploid Genome Assembly
Daanish Mahajan, Chirag Jain, Navin Kashyap

TL;DR
This paper explores the theoretical coverage and read length requirements for complete diploid genome assembly, revealing that practical algorithms need significantly higher coverage than the theoretical minimum due to repeat bridging challenges.
Contribution
It provides the first information-theoretic analysis of coverage needs for diploid genome assembly and evaluates the limitations of common assembly algorithms.
Findings
Assembly algorithms require higher coverage than the theoretical lower bound.
Double repeats in the genome pose significant challenges for assembly.
Necessary conditions for overlap graph-based assembly are derived.
Abstract
The repeat content and heterozygosity rate of a target genome are important factors in determining the feasibility of achieving a complete telomere-to-telomere assembly. The mathematical relationship between the required coverage and read length for the purpose of unique reconstruction remains unexplored for diploid genomes. We investigate the information-theoretic conditions that the given set of sequencing reads must satisfy to achieve the complete reconstruction of the true sequence of a diploid genome. We also analyze the standard greedy and de-Bruijn graph-based assembly algorithms. Our results show that the coverage and read length requirements of the assembly algorithms are considerably higher than the lower bound because both algorithms require the double repeats in the genome to be bridged. Finally, we derive the necessary conditions for the overlap graph-based assembly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChromosomal and Genetic Variations · Evolutionary Algorithms and Applications · DNA and Biological Computing
