LAKEGEN: A LLM-based Tabular Corpus Generator for Evaluating Dataset Discovery in Data Lakes
Zhenwei Dai, Chuan Lei, Asterios Katsifodimos, Xiao Qin, Christos Faloutsos, Huzefa Rangwala

TL;DR
This paper introduces LAKEGEN, a novel LLM-based tool for generating realistic, domain-specific tabular datasets with joinability annotations, to improve the evaluation of dataset discovery methods in data lakes.
Contribution
LAKEGEN leverages large language models to create diverse, high-quality tabular benchmarks with annotated joinability, addressing limitations of existing datasets for dataset discovery evaluation.
Findings
LAKEGEN produces realistic, domain-specific tables with annotated joinability.
Generated datasets improve the evaluation of dataset discovery methods.
The approach addresses limitations of current open data corpora.
Abstract
How to generate a large, realistic set of tables along with joinability relationships, to stress-test dataset discovery methods? Dataset discovery methods aim to automatically identify related data assets in a data lake. The development and evaluation of such solutions for customers from a wide range of business domains, relies on diverse, high quality and domain-specific tabular benchmarks. Large language models (LLMs) are trained on a wide variety of text data, which can provide a strong foundation of general and domain-specific knowledge. In this paper, we ask the question -- \textit{can we leverage LLMs to generate a tabular benchmark adequate for evaluating the dataset discovery solutions?} In particular, we focus on the task of finding joinable tables which is the cornerstone of virtually every dataset discovery method. Current corpora for evaluating dataset discovery methods are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Research Data Management Practices · Semantic Web and Ontologies
