TL;DR
MSPKmerCounter introduces a disk-based, memory-efficient method for large-scale k-mer counting in genome sequencing, utilizing a novel partitioning technique to reduce I/O and memory usage while outperforming existing tools.
Contribution
The paper presents MSPKmerCounter, a novel disk-based approach employing Minimum Substring Partitioning to significantly improve speed and memory efficiency in k-mer counting for large genomes.
Findings
Outperforms state-of-the-art k-mer counters in speed and memory usage.
Achieves high compression ratios reducing I/O costs.
Effective on large real-life sequencing datasets.
Abstract
A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many components in assembly (e.g. variants detection and read error correction). For large genomes, the k-mer counting task can easily consume a huge amount of memory, making it impossible for large-scale parallel assembly on commodity servers. In this paper, we develop MSPKmerCounter, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory. Our approach is based on a novel technique called Minimum Substring Partitioning (MSP). MSP breaks short reads into multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
