IDentity with Locality: An ideal hash for gene sequence search
Aditya Desai, Gaurav Gupta, Tianyi Zhang, Anshumali Shrivastava

TL;DR
This paper introduces the IDL hash family, improving gene sequence search efficiency by enhancing cache locality and reducing cache misses, leading to faster query and indexing times in hashing-based genomic search systems.
Contribution
The paper proposes the IDL hash family as a drop-in replacement for random hash functions, significantly boosting cache efficiency and system performance in gene sequence search.
Findings
Replacing RH with IDL reduces cache misses by 5x.
Query and indexing times improve up to 2x.
Theoretical analysis confirms maintained false positive rates.
Abstract
Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation and identification. The standard recipe is to cast the gene search problem as a sequence of membership problems testing if each subsequent gene substring (called kmer) of Q is present in the set of kmers of the entire gene database D. We observe that RH functions, which are crucial to the memory and the computational advantage of BF, are also detrimental to the system performance of gene-search systems. While subsequent kmers being queried are likely very similar, RH, oblivious to any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Algorithms and Data Compression · Genomics and Phylogenetic Studies
MethodsSparse Evolutionary Training · BLOOM
