Clustered Hierarchical Entropy-Scaling Search of Astronomical and Biological Data
Najib Ishaq, George Student, Noah M. Daniels

TL;DR
The paper introduces CHESS, a hierarchical search algorithm leveraging geometric properties of astronomical and biological data for efficient approximate nearest neighbors search, outperforming existing methods in speed and flexibility.
Contribution
CHESS is a novel hierarchical search method that exploits metric entropy and fractal dimensionality, offering significant speedups and flexibility over existing approximate search techniques.
Findings
13.6x speedup on APOGEE data
68x speedup on GreenGenes data
Fewer distance comparisons than FALCONN
Abstract
Both astronomy and biology are experiencing explosive growth of data, resulting in a "big data" problem that stands in the way of a "big data" opportunity for discovery. One common question asked of such data is that of approximate search (nearest neighbors search). We present a hierarchical search algorithm for such data sets that takes advantage of particular geometric properties apparent in both astronomical and biological data sets, namely the metric entropy and fractal dimensionality of the data. We present CHESS (Clustered Hierarchical Entropy-Scaling Search), a search tool with virtually no loss in specificity or sensitivity, demonstrating a speedup over linear search on the Sloan Digital Sky Survey's APOGEE data set and a speedup on the GreenGenes 16S metagenomic data set, as well as asymptotically fewer distance comparisons on APOGEE when compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Genomics and Phylogenetic Studies
