Valentine: Evaluating Matching Techniques for Dataset Discovery

Christos Koutras; George Siachamis; Andra Ionescu; Kyriakos Psarakis,; Jerry Brons; Marios Fragkoulis; Christoph Lofi; Angela Bonifati; Asterios; Katsifodimos

arXiv:2010.07386·cs.DB·February 16, 2021

Valentine: Evaluating Matching Techniques for Dataset Discovery

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis,, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, Asterios, Katsifodimos

PDF

1 Repo

TL;DR

This paper introduces Valentine, an open-source suite for evaluating schema matching techniques tailored for dataset discovery, providing a standardized framework, datasets, and comprehensive analysis to improve data lake exploration.

Contribution

Valentine offers a standardized, extensible platform with new datasets and evaluation metrics, enabling systematic comparison of schema matching methods for dataset discovery.

Findings

01

Schema matching quality varies across techniques.

02

Certain methods excel in specific discovery scenarios.

03

The evaluation highlights strengths and weaknesses of existing algorithms.

Abstract

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method's success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

delftdata/valentine
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.