DP-Bench: A Benchmark for Evaluating Data Product Creation Systems
Faisal Chowdhury, Sola Shirai, Sarthak Dash, Nandana Mihindukulasooriya, Horst Samulowitz

TL;DR
DP-Bench introduces the first comprehensive benchmark for evaluating automatic data product creation systems, leveraging existing ELT and Text-to-SQL benchmarks, and providing baseline LLM approaches.
Contribution
It presents the first benchmark for automatic data product creation, enabling standardized evaluation and comparison of different approaches.
Findings
Benchmark facilitates systematic evaluation of data product creation systems.
Baseline LLM approaches demonstrate the potential of automation in data product generation.
DP-Bench is publicly available for research and development.
Abstract
A data product is created with the intention of solving a specific problem, addressing a specific business usecase or meeting a particular need, going beyond just serving data as a raw asset. Data products enable end users to gain greater insights about their data. Since it was first introduced over a decade ago, there has been considerable work, especially in industry, to create data products manually or semi-automatically. However, there exists hardly any benchmark to evaluate automatic data product creation. In this work, we present a benchmark, first of its kind, for this task. We call it DP-Bench. We describe how this benchmark was created by taking advantage of existing work in ELT (Extract-Load-Transform) and Text-to-SQL benchmarks. We also propose a number of LLM based approaches that can be considered as baselines for generating data products automatically. We make the DP-Bench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Research Data Management Practices
