TL;DR
This paper introduces highly efficient algorithms for extracting spaced k-mers from nucleotide sequences, significantly improving speed and performance in bioinformatics applications through CPU-level bit manipulation techniques.
Contribution
The authors develop optimized, hardware-aware algorithms for spaced k-mer extraction that outperform existing methods by up to an order of magnitude.
Findings
Algorithms achieve up to 750MB/sec throughput per core.
Implementation is simple, fast, and publicly available.
Addresses common inefficiencies in k-mer processing.
Abstract
Background: Short sequence substrings of a fixed length k, called k-mers, are a ubiquitous computational primitive in bioinformatics, used across sequence indexing, read mapping, genome assembly, metagenomic classification, and comparative genomics. Spaced k-mers generalize this concept by selecting only a subset of positions within a k-mer, improving robustness to mismatches and sequencing errors. While k-mers are computationally highly efficient, spaced k-mers require additional work to be extracted from a sequence, which has slowed down existing methods. Results: We present a collection of efficient algorithms for extracting spaced k-mers from nucleotide sequences, optimized for different hardware architectures. They are based on bit manipulation instructions at CPU level, making them both simpler to implement and up to an order of magnitude faster than existing methods. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
