MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

Yang Li; XifengYan

arXiv:1505.06550·q-bio.GN·May 26, 2015

MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

Yang Li, XifengYan

PDF

1 Repo

TL;DR

MSPKmerCounter introduces a disk-based, memory-efficient method for large-scale k-mer counting in genome sequencing, utilizing a novel partitioning technique to reduce I/O and memory usage while outperforming existing tools.

Contribution

The paper presents MSPKmerCounter, a novel disk-based approach employing Minimum Substring Partitioning to significantly improve speed and memory efficiency in k-mer counting for large genomes.

Findings

01

Outperforms state-of-the-art k-mer counters in speed and memory usage.

02

Achieves high compression ratios reducing I/O costs.

03

Effective on large real-life sequencing datasets.

Abstract

A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many components in assembly (e.g. variants detection and read error correction). For large genomes, the k-mer counting task can easily consume a huge amount of memory, making it impossible for large-scale parallel assembly on commodity servers. In this paper, we develop MSPKmerCounter, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory. Our approach is based on a novel technique called Minimum Substring Partitioning (MSP). MSP breaks short reads into multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

10XGenomics/rust-pseudoaligner
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.