Distributed-Memory Parallel Contig Generation for De Novo Long-Read   Genome Assembly

Giulia Guidi; Gabriel Raulet; Daniel Rokhsar; Leonid Oliker; Katherine; Yelick; Aydin Buluc

arXiv:2207.04350·cs.DC·July 12, 2022

Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly

Giulia Guidi, Gabriel Raulet, Daniel Rokhsar, Leonid Oliker, Katherine, Yelick, Aydin Buluc

PDF

3 Repos

TL;DR

This paper introduces a scalable distributed-memory algorithm for contig generation in de novo long-read genome assembly, improving efficiency and scalability for large genomes using matrix-based graph processing.

Contribution

It presents a novel matrix-based distributed-memory approach for contig generation, enhancing scalability and efficiency in long-read genome assembly.

Findings

01

Achieves up to 80% parallel efficiency on 128 nodes.

02

Produces uniform genome coverage and maintains assembly quality.

03

Reduces computational load by localizing the assembly process.

Abstract

De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is increasing the computational demand and requires scalable, high-performance approaches. In this work, we present a novel distributed-memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome. Using matrix abstraction, we mask branches in the string graph and compute the connected component to group genomic sequences that belong to the same linear chain (i.e., contig). Then, we perform multiway number partitioning to minimize the load imbalance in local assembly, i.e., concatenation of sequences from a given…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.