Parallel String Graph Construction and Transitive Reduction for De Novo   Genome Assembly

Giulia Guidi; Oguz Selvitopi; Marquita Ellis; Leonid Oliker; Katherine; Yelick; Aydin Buluc

arXiv:2010.10055·cs.DC·October 21, 2020

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

Giulia Guidi, Oguz Selvitopi, Marquita Ellis, Leonid Oliker, Katherine, Yelick, Aydin Buluc

PDF

3 Repos

TL;DR

This paper introduces distributed-memory parallel algorithms for overlap detection and transitive reduction in de novo genome assembly, significantly improving efficiency and scalability for large genomes using long reads.

Contribution

The authors develop novel distributed algorithms based on linear algebra for overlap detection and layout simplification, implemented in the diBELLA 2D pipeline, advancing large genome assembly.

Findings

01

Achieves near linear scaling with over 80% efficiency on the human genome.

02

Reduces runtime for overlap detection by 1.2-1.3x for human genome.

03

Transitive reduction outperforms existing methods by up to 29x.

Abstract

One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.