CHORUS: Foundation Models for Unified Data Discovery and Exploration
Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan, Olteanu, Dan Suciu

TL;DR
This paper demonstrates that foundation models, especially large language models, excel in data discovery and exploration tasks, outperforming traditional models and sometimes even human experts, indicating a unified future approach for data management.
Contribution
The paper introduces a foundation-model-based approach for data discovery tasks, showing its superiority over task-specific models and establishing a new direction for unified data management.
Findings
Foundation models outperform task-specific models on key data discovery tasks.
The approach often surpasses human-expert performance.
The method generalizes across multiple foundation models and handles non-determinism.
Abstract
We apply foundation models to data discovery and exploration tasks. Foundation models include large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. We investigate the fundamental characteristics of this approach including generalizability to several foundation models and the impact of non-determinism on the outputs. All in all, this suggests a future direction in which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Scientific Computing and Data Management · Distributed and Parallel Computing Systems
