Assembly of repetitive regions using next-generation sequencing data

Robert M. Nowak

arXiv:1411.0395·q-bio.GN·January 8, 2015

Assembly of repetitive regions using next-generation sequencing data

Robert M. Nowak

PDF

TL;DR

This paper introduces a new DNA assembly algorithm that leverages read frequency to accurately reconstruct long repetitive regions, outperforming existing methods especially at high sequencing coverage.

Contribution

A novel algorithm utilizing read frequency for assembling repetitive DNA regions, with a mathematical model defining accuracy limits based on coverage.

Findings

01

Successfully reconstructed long repeats where existing assemblers failed

02

Mathematical model relates accuracy to read coverage and repeat length

03

Effective with high read depth from next-generation sequencing

Abstract

High read depth can be used to assemble short sequence repeats. The existing genome assemblers fail in repetitive regions of longer than average read. I propose a new algorithm for a DNA assembly which uses the relative frequency of reads to properly reconstruct repetitive sequences. The mathematical model shows the upper limits of accuracy of the results as a function of read coverage. For high coverage, the estimation error depends linearly on repetitive sequence length and inversely proportional to the sequencing coverage. The algorithm requires high read depth, provided by the next-generation sequencers and could use the existing data. The tests on errorless reads, generated in silico from several model genomes, pointed the properly reconstructed repetitive sequences, where existing assemblers fail.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.