DPDisc: From Factoid Questions to Data Product Requests for Open-World Data Product Discovery over Tables and Text
Liangliang Zhang, Nandana Mihindukulasooriya, Niharika S. D'Souza, Sola Shirai, Sarthak Dash, Yao Ma, Horst Samulowitz

TL;DR
DPDisc introduces the first large-scale benchmark for discovering coherent data products from hybrid table-text corpora, enabling automated retrieval of related data assets for business use cases.
Contribution
It presents DPDisc, a novel benchmark and DPForge pipeline for systematic data product discovery over tables and text, filling a key gap in existing datasets.
Findings
Baseline retrieval methods show performance gaps across domains.
DPDisc enables evaluation of structure-aware data discovery methods.
The benchmark facilitates future research in automated data product retrieval.
Abstract
Data products are reusable, self-contained assets designed for specific business use cases. Automating their discovery is of great industry interest, as it enables efficient data access in large data lakes and supports analytical workflows. However, no benchmark currently exists for data product discovery over hybrid table-text corpora. Existing datasets focus on answering single factoid questions over individual tables rather than assembling multiple related data assets into coherent products. To address this gap, we present DPDisc, the first large-scale benchmark for data product discovery, where systems must retrieve coherent collections of tables and passages to satisfy high-level Data Product Requests (DPRs). We introduce DPForge, an automated pipeline that systematically repurposes table-text QA datasets by clustering related tables and passages into coherent data products,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
