Semantic Indexes for Machine Learning-based Queries over Unstructured Data
Daniel Kang, John Guibas, Peter Bailis, Tatsunori Hashimoto, Matei, Zaharia

TL;DR
This paper introduces TASTI, a trainable semantic index that uses dataset-wide embeddings to generate proxy scores for unstructured data, eliminating per-query proxies and significantly speeding up query processing.
Contribution
TASTI is a novel index that leverages semantic similarity to produce proxy scores without per-query training, reducing construction costs and improving query speed.
Findings
TASTI reduces index construction costs by up to 10x.
TASTI accelerates query processing by up to 24x.
Theoretical analysis guarantees query accuracy based on embedding error.
Abstract
Unstructured data (e.g., video or text) is now commonly queried by using computationally expensive deep neural networks or human labelers to produce structured information, e.g., object types and positions in video. To accelerate queries, many recent systems (e.g., BlazeIt, NoScope, Tahoma, SUPG, etc.) train a query-specific proxy model to approximate a large target labelers (i.e., these expensive neural networks or human labelers). These models return proxy scores that are then used in query processing algorithms. Unfortunately, proxy models usually have to be trained per query and require large amounts of annotations from the target labelers. In this work, we develop an index (trainable semantic index, TASTI) that simultaneously removes the need for per-query proxies and is more efficient to construct than prior indexes. TASTI accomplishes this by leveraging semantic similarity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Data Quality and Management · Data Management and Algorithms
