DPDisc: From Factoid Questions to Data Product Requests for Open-World Data Product Discovery over Tables and Text

Liangliang Zhang; Nandana Mihindukulasooriya; Niharika S. D'Souza; Sola Shirai; Sarthak Dash; Yao Ma; Horst Samulowitz

arXiv:2510.21737·cs.IR·March 19, 2026

DPDisc: From Factoid Questions to Data Product Requests for Open-World Data Product Discovery over Tables and Text

Liangliang Zhang, Nandana Mihindukulasooriya, Niharika S. D'Souza, Sola Shirai, Sarthak Dash, Yao Ma, Horst Samulowitz

PDF

1 Datasets

TL;DR

DPDisc introduces the first large-scale benchmark for discovering coherent data products from hybrid table-text corpora, enabling automated retrieval of related data assets for business use cases.

Contribution

It presents DPDisc, a novel benchmark and DPForge pipeline for systematic data product discovery over tables and text, filling a key gap in existing datasets.

Findings

01

Baseline retrieval methods show performance gaps across domains.

02

DPDisc enables evaluation of structure-aware data discovery methods.

03

The benchmark facilitates future research in automated data product retrieval.

Abstract

Data products are reusable, self-contained assets designed for specific business use cases. Automating their discovery is of great industry interest, as it enables efficient data access in large data lakes and supports analytical workflows. However, no benchmark currently exists for data product discovery over hybrid table-text corpora. Existing datasets focus on answering single factoid questions over individual tables rather than assembling multiple related data assets into coherent products. To address this gap, we present DPDisc, the first large-scale benchmark for data product discovery, where systems must retrieve coherent collections of tables and passages to satisfy high-level Data Product Requests (DPRs). We introduce DPForge, an automated pipeline that systematically repurposes table-text QA datasets by clustering related tables and passages into coherent data products,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ibm-research/data-product-benchmark
dataset· 194 dl
194 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.