Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN
Mohammad Saleh Refahi, Gavin Hearne, Harrison Muller, Kieran Lynch, Bahrad A. Sokhansanj, James R. Brown, Gail Rosen

TL;DR
This paper compares FAISS and ScaNN for large-scale gene embedding similarity search, demonstrating their efficiency and effectiveness in bioinformatics applications over traditional methods.
Contribution
It provides a systematic evaluation of FAISS and ScaNN for gene embedding search, highlighting their advantages in speed, memory use, and retrieval quality for bioinformatics.
Findings
FAISS and ScaNN outperform traditional methods in speed and memory efficiency.
Embedding-based search improves detection of novel and divergent sequences.
ScaNN shows superior performance in certain bioinformatics benchmarks.
Abstract
The exponential growth of DNA sequencing data has outpaced traditional heuristic-based methods, which struggle to scale effectively. Efficient computational approaches are urgently needed to support large-scale similarity search, a foundational task in bioinformatics for detecting homology, functional similarity, and novelty among genomic and proteomic sequences. Although tools like BLAST have been widely used and remain effective in many scenarios, they suffer from limitations such as high computational cost and poor performance on divergent sequences. In this work, we explore embedding-based similarity search methods that learn latent representations capturing deeper structural and functional patterns beyond raw sequence alignment. We systematically evaluate two state-of-the-art vector search libraries, FAISS and ScaNN, on biologically meaningful gene embeddings. Unlike prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Machine Learning in Bioinformatics
