PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?
Jingzhe Xu, Rui Wang, Jiannan Wang, and Guoliang Li

TL;DR
PrepBench is a new benchmark designed to evaluate how well current large language models can support natural language-driven data preparation tasks, highlighting existing challenges and gaps.
Contribution
The paper introduces PrepBench, a comprehensive benchmark for assessing LLM capabilities in NL-driven data prep, addressing key real-world complexities.
Findings
Current LLMs struggle with complex, multi-step data preparation tasks.
Nearly half of the tasks require over 100 lines of Python code.
Realizing NL-driven data preparation remains a significant challenge for state-of-the-art models.
Abstract
Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations through visual operators and workflows. Recent advances in large language models (LLMs) raise the possibility of a paradigm shift toward natural language (NL)-driven data preparation, in which users can specify preparation intents in NL directly. However, it remains unclear how far current LLM-based agents are from this paradigm shift in practice. Existing code generation benchmarks do not capture key characteristics of data preparation, including ambiguous user intents, imperfect real-world data, and the need to translate code into interpretable workflows for validation. To bridge this gap, we present PrepBench, a benchmark designed to evaluate NL-driven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
