Pylon: Semantic Table Union Search in Data Lakes
Tianji Cong, Fatemeh Nargesian, H. V. Jagadish

TL;DR
This paper introduces Pylon, a data-driven, unsupervised learning approach for discovering union-able tables in data lakes by embedding columns based on their semantic similarity, significantly improving retrieval accuracy and speed.
Contribution
Pylon presents a novel self-supervised contrastive learning method to identify semantically similar columns across heterogeneous datasets in data lakes.
Findings
Achieves 16% higher precision in table union discovery
Improves recall by 17% over existing methods
Reduces query response time by 7 times
Abstract
The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from disparate sources. In this work, we address one important class of discovery needs: finding union-able tables. The task is to find tables in a data lake that can be unioned with a given query table. The challenge is to recognize union-able columns even if they are represented differently. In this paper, we propose a data-driven learning approach: specifically, an unsupervised representation learning and embedding retrieval task. Our key idea is to exploit self-supervised contrastive learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Time Series Analysis and Forecasting · Data-Driven Disease Surveillance
MethodsContrastive Learning
