LakeBench: Benchmarks for Data Discovery over Data Lakes
Kavitha Srinivas, Julian Dolby, Ibrahim Abdelaziz, Oktie Hassanzadeh,, Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Subhajit Chaudhury, Horst, Samulowitz

TL;DR
This paper introduces LakeBench, a set of benchmarks for data discovery tasks in data lakes, evaluating foundational models and highlighting the need for specialized models in enterprise data management.
Contribution
The paper develops and releases multiple benchmarks for data discovery in data lakes using diverse real-world datasets, and evaluates existing models' performance on these tasks.
Findings
Existing models perform poorly on data discovery benchmarks
Benchmarks reveal significant room for improvement in tabular data models
Establishes a new standard for evaluating data discovery capabilities
Abstract
Within enterprises, there is a growing need to intelligently navigate data lakes, specifically focusing on data discovery. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. There is a dearth of benchmarks for these tasks in the public domain, with related work targeting private datasets. In LakeBench, we develop multiple benchmarks for these tasks by using the tables that are drawn from a diverse set of data sources such as government data from CKAN, Socrata, and the European Central Bank. We compare the performance of 4 publicly available tabular foundational models on these tasks. None of the existing models had been trained on the data discovery tasks that we developed for this benchmark; not surprisingly, their performance shows significant room for improvement. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Big Data and Business Intelligence
MethodsNone
