Empowering Tabular Data Preparation with Language Models: Why and How?
Mengshi Chen, Yuxiang Sun, Tengchao Li, Jianwei Wang, Kai Wang, Xuemin Lin, Ying Zhang, Wenjie Zhang

TL;DR
This paper systematically explores how Large Language Models can be effectively utilized across all phases of tabular data preparation, addressing current challenges and proposing integrated approaches.
Contribution
It provides a comprehensive analysis of the role of LMs in data acquisition, integration, cleaning, and transformation for tabular data preparation.
Findings
LMs can automate complex data cleaning tasks
Integrated pipelines enhance data preparation efficiency
Key advancements in LM applications for data tasks
Abstract
Data preparation is a critical step in enhancing the usability of tabular data and thus boosts downstream data-driven tasks. Traditional methods often face challenges in capturing the intricate relationships within tables and adapting to the tasks involved. Recent advances in Language Models (LMs), especially in Large Language Models (LLMs), offer new opportunities to automate and support tabular data preparation. However, why LMs suit tabular data preparation (i.e., how their capabilities match task demands) and how to use them effectively across phases still remain to be systematically explored. In this survey, we systematically analyze the role of LMs in enhancing tabular data preparation processes, focusing on four core phases: data acquisition, integration, cleaning, and transformation. For each phase, we present an integrated analysis of how LMs can be combined with other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Computational and Text Analysis Methods · Topic Modeling
