Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
Takeshi Ogasawara, Yinhe Cheng, Tzy-Hwa Kathy Tzeng

TL;DR
sam2bam is a high-performance framework that accelerates NGS data pre-processing, achieving over 150x speedup on single-node systems by leveraging multi-core, large memory, and hardware accelerators.
Contribution
It introduces a flexible, parallel software framework for NGS data pre-processing that significantly reduces runtime compared to standard tools.
Findings
Reduced duplicate marking runtime by 156-186x
Processed whole-exome data in about one minute
Processed whole-genome data in about nine minutes
Abstract
This paper introduces a high-throughput software tool framework called {\it sam2bam} that enables users to significantly speedup pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156-186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize the multiple processors, available memory, high-bandwidth of storage, and hardware compression accelerators if available. The sam2bam provides file format conversion between well-known genome file formats, from SAM to BAM, as a basic feature. Additional features such as analyzing, filtering, and converting the input data are provided by {\it plug-in} tools, e.g., duplicate marking, which can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
