Lost in the Pipeline: How Well Do Large Language Models Handle Data Preparation?
Matteo Spreafico, Ludovica Tassini, Camilla Sancricca, Cinzia Cappiello

TL;DR
This paper evaluates the effectiveness of large language models in supporting data preparation tasks like profiling and cleaning, comparing them to traditional tools, and assessing their practical utility through user studies.
Contribution
It introduces a comprehensive evaluation of large language models' capabilities in data preparation, including a custom quality model validated by user feedback.
Findings
Large language models can assist in data profiling and cleaning tasks.
Support from LLMs is comparable to traditional data preparation tools.
User study validates the practical usefulness of LLM support in data tasks.
Abstract
Large language models have recently demonstrated their exceptional capabilities in supporting and automating various tasks. Among the tasks worth exploring for testing large language model capabilities, we considered data preparation, a critical yet often labor-intensive step in data-driven processes. This paper investigates whether large language models can effectively support users in selecting and automating data preparation tasks. To this aim, we considered both general-purpose and fine-tuned tabular large language models. We prompted these models with poor-quality datasets and measured their ability to perform tasks such as data profiling and cleaning. We also compare the support provided by large language models with that offered by traditional data preparation tools. To evaluate the capabilities of large language models, we developed a custom-designed quality model that has been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Artificial Intelligence in Healthcare and Education · Topic Modeling
