TL;DR
HySortK is a novel distributed memory k-mer counting tool that employs sorting-based algorithms and flexible hybrid parallelism, significantly improving speed and memory efficiency for large-scale DNA data analysis.
Contribution
This work introduces HySortK, a sorting-based distributed k-mer counter with a flexible parallelism model, outperforming existing hash-based methods in speed and memory usage.
Findings
Achieves 2-10x speedup over GPU baseline on multiple nodes.
Reduces peak memory usage by 30% compared to state-of-the-art CPU software.
Provides up to 1.8x speedup when integrated into genome assembly pipelines.
Abstract
In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA subsequences, is a fundamental step in various bioinformatics pipelines, including genome assembly and protein prediction. Due to the growing volume of data, the scaling of the counting process is critical. In the literature, distributed memory software uses hash tables, which exhibit poor cache friendliness and consume excessive memory. They often also lack support for flexible parallelism, which makes integration into existing bioinformatics pipelines difficult. In this work, we propose HySortK, a highly efficient sorting-based distributed memory k-mer counter. HySortK reduces the communication volume through a carefully designed communication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
