Assembly of repetitive regions using next-generation sequencing data
Robert M. Nowak

TL;DR
This paper introduces a new DNA assembly algorithm that leverages read frequency to accurately reconstruct long repetitive regions, outperforming existing methods especially at high sequencing coverage.
Contribution
A novel algorithm utilizing read frequency for assembling repetitive DNA regions, with a mathematical model defining accuracy limits based on coverage.
Findings
Successfully reconstructed long repeats where existing assemblers failed
Mathematical model relates accuracy to read coverage and repeat length
Effective with high read depth from next-generation sequencing
Abstract
High read depth can be used to assemble short sequence repeats. The existing genome assemblers fail in repetitive regions of longer than average read. I propose a new algorithm for a DNA assembly which uses the relative frequency of reads to properly reconstruct repetitive sequences. The mathematical model shows the upper limits of accuracy of the results as a function of read coverage. For high coverage, the estimation error depends linearly on repetitive sequence length and inversely proportional to the sequencing coverage. The algorithm requires high read depth, provided by the next-generation sequencers and could use the existing data. The tests on errorless reads, generated in silico from several model genomes, pointed the properly reconstructed repetitive sequences, where existing assemblers fail.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
