Advances in practical k-mer sets: essentials for the curious

Camille Marchet

arXiv:2409.05210·q-bio.GN·September 13, 2024

Advances in practical k-mer sets: essentials for the curious

Camille Marchet

PDF

Open Access

TL;DR

This survey reviews various data structures for representing k-mer sets in sequencing, focusing on methods using hashing or lexicographic properties, and discusses recent improvements in efficiency and capabilities.

Contribution

It categorizes and compares existing data structures for k-mer sets, highlighting recent advancements and providing a comprehensive overview for researchers and practitioners.

Findings

01

Hashing and lexicographic methods are the main strategies.

02

Recent advancements improve memory efficiency and query speed.

03

Supports key operations like membership queries and updates.

Abstract

This paper provides a comprehensive survey of data structures for representing k-mer sets, which are fundamental in high-throughput sequencing analysis. It categorizes the methods into two main strategies: those using fingerprinting and hashing for compact storage, and those leveraging lexicographic properties for efficient representation. The paper reviews key operations supported by these structures, such as membership queries and dynamic updates, and highlights recent advancements in memory efficiency and query speed. A companion paper explores colored k-mer sets, which extend these concepts to integrate multiple datasets or genomes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Search Problems · Optimization and Packing Problems