TL;DR
This paper introduces Valentine, an open-source suite for evaluating schema matching techniques tailored for dataset discovery, providing a standardized framework, datasets, and comprehensive analysis to improve data lake exploration.
Contribution
Valentine offers a standardized, extensible platform with new datasets and evaluation metrics, enabling systematic comparison of schema matching methods for dataset discovery.
Findings
Schema matching quality varies across techniques.
Certain methods excel in specific discovery scenarios.
The evaluation highlights strengths and weaknesses of existing algorithms.
Abstract
Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method's success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
