EasyTUS: A Comprehensive Framework for Fast and Accurate Table Union Search across Data Lakes
Tim Otto

TL;DR
EasyTUS is a new framework that uses Large Language Models to efficiently and accurately find tables in data lakes that can be unioned with a given table, improving speed and precision.
Contribution
This work introduces EasyTUS, a novel LLM-based framework for scalable Table Union Search and a standardized benchmarking environment, TUSBench, for systematic evaluation.
Findings
Up to 34.3% improvement in MAP over state-of-the-art methods.
Up to 79.2x faster data preparation.
Up to 7.7x faster query processing.
Abstract
Data lakes enable easy maintenance of heterogeneous data in its native form. While this flexibility can accelerate data ingestion, it shifts the complexity of data preparation and query processing to data discovery tasks. One such task is Table Union Search (TUS), which identifies tables that can be unioned with a given input table. In this work, we present EasyTUS, a comprehensive framework that leverages Large Language Models (LLMs) to perform efficient and scalable Table Union Search across data lakes. EasyTUS implements the search pipeline as three modular steps: Table Serialization for consistent formatting and sampling, Table Representation that utilizes LLMs to generate embeddings, and Vector Search that leverages approximate nearest neighbor indexing for semantic matching. To enable reproducible and systematic evaluation, in this paper, we also introduce TUSBench, a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Machine Learning in Healthcare · Time Series Analysis and Forecasting
