Safe and complete contig assembly via omnitigs

Alexandru I. Tomescu; Paul Medvedev

arXiv:1601.02932·q-bio.QM·August 18, 2016

Safe and complete contig assembly via omnitigs

Alexandru I. Tomescu, Paul Medvedev

PDF

TL;DR

This paper introduces omnitigs, a new class of contigs that are guaranteed to appear in any genome consistent with the assembly graph, providing longer and more complete contigs than traditional methods.

Contribution

It provides a formal characterization of all safe contigs in a genome graph and presents a polynomial-time algorithm to find them, improving assembly completeness.

Findings

01

Omnitigs are 66% to 82% longer than unitigs on average.

02

29% of dbSNP locations have more neighbors in omnitigs than in unitigs.

03

The method guarantees safe and complete contig assembly.

Abstract

Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph $G$ (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from $G$ as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.