Efficient and Effective Table-Centric Table Union Search in Data Lakes
Yongkang Sun, Zhihao Ding, Huiqiang Wang, Reynold Cheng, Jieming Shi

TL;DR
The paper introduces TACTUS, a novel table-centric approach for union search in data lakes that improves accuracy and efficiency by focusing on holistic table-level semantics and adaptive candidate retrieval.
Contribution
It proposes a table-first search method with new table embedding techniques, enhancing unionability scoring and search efficiency over prior column-centric approaches.
Findings
Significantly improves search result quality.
Achieves an order of magnitude faster processing.
Effective in real-world data lake scenarios.
Abstract
In data lakes, information on the same subject is often fragmented across multiple tables. Table union search aims to find the top-k tables that can be unioned with a query table to extend it with more rows, without relying on metadata or ground-truth labels. Existing methods are mainly column-centric: they focus on modeling column unionability scores using column embeddings, which are then used throughout the search process for column matching, filtering, and aggregation. However, this overlooks holistic table-level semantics, which may result in suboptimal rankings and inefficiencies. We introduce TACTUS, a novel table-centric method for table union search. Unlike prior work that searches from columns to tables, we search in a table-first way and examine columns only in the final step. During offline processing, we directly generate table embeddings for holistic, table-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Time Series Analysis and Forecasting · Data Visualization and Analytics
