Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines

Yuhang Ge; Yachuan Liu; Zhangyan Ye; Yuren Mao; Yunjun Gao

arXiv:2505.15874·cs.IR·November 11, 2025

Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines

Yuhang Ge, Yachuan Liu, Zhangyan Ye, Yuren Mao, Yunjun Gao

PDF

TL;DR

Text-to-Pipeline introduces a task translating natural language data prep instructions into executable pipelines, supported by a large benchmark, revealing current LLM limitations and proposing an iterative agent baseline.

Contribution

The paper presents the novel Text-to-Pipeline task, a large-scale benchmark PARROT, and an execution-aware agent baseline to advance automated data preparation from natural language instructions.

Findings

01

LLMs struggle with multi-step compositional logic.

02

Semantic parameter grounding remains a challenge for LLMs.

03

Pipeline-Agent achieves state-of-the-art performance but still has significant gaps.

Abstract

Data preparation (DP) transforms raw data into a form suitable for downstream applications, typically by composing operations into executable pipelines. Building such pipelines is time-consuming and requires sophisticated programming skills, posing a significant barrier for non-experts. To lower this barrier, we introduce Text-to-Pipeline, a new task that translates NL data preparation instructions into DP pipelines, and PARROT, a large-scale benchmark to support systematic evaluation. To ensure realistic DP scenarios, PARROT is built by mining transformation patterns from production pipelines and instantiating them on 23,009 real-world tables, resulting in ~18,000 tasks spanning 16 core operators. Our empirical evaluation on PARROT reveals a critical failure mode in cutting-edge LLMs: they struggle not only with multi-step compositional logic but also with semantic parameter grounding.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.