Pylon: Semantic Table Union Search in Data Lakes

Tianji Cong; Fatemeh Nargesian; H. V. Jagadish

arXiv:2301.04901·cs.DB·January 16, 2023

Pylon: Semantic Table Union Search in Data Lakes

Tianji Cong, Fatemeh Nargesian, H. V. Jagadish

PDF

Open Access 1 Repo

TL;DR

This paper introduces Pylon, a data-driven, unsupervised learning approach for discovering union-able tables in data lakes by embedding columns based on their semantic similarity, significantly improving retrieval accuracy and speed.

Contribution

Pylon presents a novel self-supervised contrastive learning method to identify semantically similar columns across heterogeneous datasets in data lakes.

Findings

01

Achieves 16% higher precision in table union discovery

02

Improves recall by 17% over existing methods

03

Reduces query response time by 7 times

Abstract

The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from disparate sources. In this work, we address one important class of discovery needs: finding union-able tables. The task is to find tables in a data lake that can be unioned with a given query table. The challenge is to recognize union-able columns even if they are represented differently. In this paper, we propose a data-driven learning approach: specifically, an unsupervised representation learning and embedding retrieval task. Our key idea is to exploit self-supervised contrastive learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

superctj/pylon
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Time Series Analysis and Forecasting · Data-Driven Disease Surveillance

MethodsContrastive Learning