High-Performance Sorting-Based k-mer Counting in Distributed Memory with   Flexible Hybrid Parallelism

Yifan Li; Giulia Guidi

arXiv:2407.07718·cs.DC·July 11, 2024

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Yifan Li, Giulia Guidi

PDF

1 Repo

TL;DR

HySortK is a novel distributed memory k-mer counting tool that employs sorting-based algorithms and flexible hybrid parallelism, significantly improving speed and memory efficiency for large-scale DNA data analysis.

Contribution

This work introduces HySortK, a sorting-based distributed k-mer counter with a flexible parallelism model, outperforming existing hash-based methods in speed and memory usage.

Findings

01

Achieves 2-10x speedup over GPU baseline on multiple nodes.

02

Reduces peak memory usage by 30% compared to state-of-the-art CPU software.

03

Provides up to 1.8x speedup when integrated into genome assembly pipelines.

Abstract

In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA subsequences, is a fundamental step in various bioinformatics pipelines, including genome assembly and protein prediction. Due to the growing volume of data, the scaling of the counting process is critical. In the literature, distributed memory software uses hash tables, which exhibit poor cache friendliness and consume excessive memory. They often also lack support for flexible parallelism, which makes integration into existing bioinformatics pipelines difficult. In this work, we propose HySortK, a highly efficient sorting-based distributed memory k-mer counter. HySortK reduces the communication volume through a carefully designed communication…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CornellHPC/HySortK
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.