TL;DR
This paper introduces a scalable distributed-memory algorithm for contig generation in de novo long-read genome assembly, improving efficiency and scalability for large genomes using matrix-based graph processing.
Contribution
It presents a novel matrix-based distributed-memory approach for contig generation, enhancing scalability and efficiency in long-read genome assembly.
Findings
Achieves up to 80% parallel efficiency on 128 nodes.
Produces uniform genome coverage and maintains assembly quality.
Reduces computational load by localizing the assembly process.
Abstract
De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is increasing the computational demand and requires scalable, high-performance approaches. In this work, we present a novel distributed-memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome. Using matrix abstraction, we mask branches in the string graph and compute the connected component to group genomic sequences that belong to the same linear chain (i.e., contig). Then, we perform multiway number partitioning to minimize the load imbalance in local assembly, i.e., concatenation of sequences from a given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
