Advances in practical k-mer sets: essentials for the curious
Camille Marchet

TL;DR
This survey reviews various data structures for representing k-mer sets in sequencing, focusing on methods using hashing or lexicographic properties, and discusses recent improvements in efficiency and capabilities.
Contribution
It categorizes and compares existing data structures for k-mer sets, highlighting recent advancements and providing a comprehensive overview for researchers and practitioners.
Findings
Hashing and lexicographic methods are the main strategies.
Recent advancements improve memory efficiency and query speed.
Supports key operations like membership queries and updates.
Abstract
This paper provides a comprehensive survey of data structures for representing k-mer sets, which are fundamental in high-throughput sequencing analysis. It categorizes the methods into two main strategies: those using fingerprinting and hashing for compact storage, and those leveraging lexicographic properties for efficient representation. The paper reviews key operations supported by these structures, such as membership queries and dynamic updates, and highlights recent advancements in memory efficiency and query speed. A companion paper explores colored k-mer sets, which extend these concepts to integrate multiple datasets or genomes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Search Problems · Optimization and Packing Problems
