Generative Benchmark Creation for Table Union Search
Koyena Pal, Aamod Khatiwada, Roee Shraga, Ren\'ee J. Miller

TL;DR
This paper introduces a novel AI-based method to generate structured benchmarks for semantic table union search, creating more challenging and scalable datasets to evaluate and improve data management techniques.
Contribution
It proposes a new generative approach for creating structured data benchmarks, enabling scalable, robust, and semantically meaningful evaluation of table union search methods.
Findings
The new benchmark is more challenging than existing ones.
Recent search methods perform significantly worse on the new benchmark.
The benchmark allows detailed analysis of false positives and negatives.
Abstract
Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is finding tables that can be unioned. While any two tables with the same cardinality can be unioned, table union search is the problem of finding tables whose union is semantically coherent. Semantic problems cannot be benchmarked using synthetic data. Our current methods for creating benchmarks involve the manual curation and labeling of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how robust the created benchmarks are. We propose to use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Time Series Analysis and Forecasting
