Scalable De Novo Genome Assembly Using Pregel
Da Yan, Hongzhi Chen, James Cheng, Zhenkun Cai, Bin Shao

TL;DR
This paper introduces PPA-assembler, a distributed toolkit for de novo genome assembly using Pregel, which significantly improves efficiency and maintains high sequencing quality on large datasets.
Contribution
It presents a scalable, Pregel-based toolkit for de novo genome assembly that offers strong performance guarantees and flexibility for various sequencing strategies.
Findings
Outperforms existing methods in efficiency on large datasets
Maintains high sequencing quality
Demonstrates scalability and flexibility in distributed environment
Abstract
De novo genome assembly is the process of stitching short DNA sequences to generate longer DNA sequences, without using any reference sequence for alignment. It enables high-throughput genome sequencing and thus accelerates the discovery of new genomes. In this paper, we present a toolkit, called PPA-assembler, for de novo genome assembly in a distributed setting. The operations in our toolkit provide strong performance guarantees, and can be assembled to implement various sequencing strategies. PPA-assembler adopts the popular {\em de Bruijn graph} based approach for sequencing, and each operation is implemented as a program in Google's Pregel framework for big graph processing. Experiments on large real and simulated datasets demonstrate that PPA-assembler is much more efficient than the state-of-the-arts and provides good sequencing quality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Graph Theory and Algorithms · Algorithms and Data Compression
