Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning
Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Ren\'ee Miller

TL;DR
This paper introduces Starmie, a novel framework for dataset discovery in data lakes that uses contrastive learning of contextualized column representations, significantly improving search effectiveness and efficiency.
Contribution
Starmie employs a contrastive learning approach with pre-trained language models to capture semantic information for dataset discovery, and introduces a scalable indexing method for fast search.
Findings
Outperforms existing solutions in table union search effectiveness by 6.8 in MAP and recall.
Uses HNSW index for 3,000X faster query processing compared to linear scan.
Achieves 400X performance gain over LSH index for data lake indexing.
Abstract
Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical evaluation results on real table benchmark datasets show that Starmie outperforms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Biomedical Text Mining and Ontologies
