Featurized-Decomposition Join: Low-Cost Semantic Joins with Guarantees
Sepanta Zeighami, Shreya Shankar, Aditya Parameswaran

TL;DR
This paper introduces Featurized-Decomposition Join (FDJ), a novel method that reduces the cost of semantic joins in large datasets by automatically extracting features and constructing logical expressions with guarantees, outperforming existing approaches.
Contribution
FDJ automatically extracts features from text records and composes them into logical expressions, significantly reducing semantic join costs while maintaining quality guarantees.
Findings
Up to 10x cost reduction compared to state-of-the-art methods
Maintains the same quality guarantees as existing approaches
Effective feature extraction and logical composition for semantic joins
Abstract
Large Language Models (LLMs) are being increasingly used within data systems to process large datasets with text fields. A broad class of such tasks involves a semantic join-joining two tables based on a natural language predicate per pair of tuples, evaluated using an LLM. Semantic joins generalize tasks such as entity matching and record categorization, as well as more complex text understanding tasks. A naive implementation is expensive as it requires invoking an LLM for every pair of rows in the cross product. Existing approaches mitigate this cost by first applying embedding-based semantic similarity to filter candidate pairs, deferring to an LLM only when similarity scores are deemed inconclusive. However, these methods yield limited gains in practice, since semantic similarity may not reliably predict the join outcome. We propose Featurized-Decomposition Join (FDJ for short), a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Advanced Text Analysis Techniques
