Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search
Kathryn E. Kirchoff, James Wellnitz, Joshua E. Hochuli, Travis, Maxfield, Konstantin I. Popov, Shawn Gomez, Alexander Tropsha

TL;DR
This paper explores using low-dimensional chemical embeddings combined with k-d trees to enable rapid, scalable chemical similarity searches on billion-scale databases, achieving significant speedups while maintaining accuracy.
Contribution
It introduces a novel framework combining low-dimensional embeddings and k-d trees for fast chemical similarity search, demonstrating efficiency and competitive accuracy on large datasets.
Findings
Searches on over one billion chemicals take less than a second on a single CPU core.
The approach is five orders of magnitude faster than brute-force methods.
SmallSA embedding achieves competitive performance on standard benchmarks.
Abstract
Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Various Chemistry Research Topics · Analytical Chemistry and Chromatography
