Nearest Keyword Set Search in Multi-dimensional Datasets
Vishwakarma Singh, Ambuj K. Singh

TL;DR
This paper introduces ProMiSH, a novel hashing-based method for keyword-based search in multi-dimensional datasets, significantly improving speed and scalability over existing techniques.
Contribution
ProMiSH is a new projection and hashing method that enables fast, scalable exact and approximate keyword group searches in high-dimensional datasets.
Findings
ProMiSH achieves over 10,000x speedup compared to tree-based methods.
It scales linearly with dataset size, dimension, query size, and result size.
Effective on datasets up to 10 million points and 100 dimensions.
Abstract
Keyword-based search in text-rich multi-dimensional datasets facilitates many novel applications and tools. In this paper, we consider objects that are tagged with keywords and are embedded in a vector space. For these datasets, we study queries that ask for the tightest groups of points satisfying a given set of keywords. We propose a novel method called ProMiSH (Projection and Multi Scale Hashing) that uses random projection and hash-based index structures, and achieves high scalability and speedup. We present an exact and an approximate version of the algorithm. Our empirical studies, both on real and synthetic datasets, show that ProMiSH has a speedup of more than four orders over state-of-the-art tree-based techniques. Our scalability tests on datasets of sizes up to 10 million and dimensions up to 100 for queries having up to 9 keywords show that ProMiSH scales linearly with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Data Management and Algorithms · Algorithms and Data Compression
