DeepPrep: An LLM-Powered Agentic System for Autonomous Data Preparation
Meihao Fan, Ju Fan, Yuxin Zhang, Shaolei Zhang, Xiaoyong Du, Jie Song, Peng Li, Fuxin Jiang, Tieying Zhang, Jianjun Chen

TL;DR
DeepPrep is an innovative LLM-powered system that autonomously constructs data preparation pipelines through iterative, execution-grounded interactions, enabling structured revisions and achieving high accuracy with lower inference costs.
Contribution
DeepPrep introduces a tree-based agentic reasoning framework and a progressive training method for autonomous, efficient, and accurate data preparation using LLMs.
Findings
Achieves data preparation accuracy comparable to GPT-5
Incurred 15x lower inference cost than GPT-5
Outperforms existing open-source baselines across datasets
Abstract
Data preparation, which aims to transform heterogeneous and noisy raw tables into analysis-ready data, remains a major bottleneck in data science. Recent approaches leverage large language models (LLMs) to automate data preparation from natural language specifications. However, existing LLM-powered methods either make decisions without grounding in intermediate execution results, or rely on linear interaction processes that offer limited support for revising earlier decisions. To address these limitations, we propose DeepPrep, an LLM-powered agentic system for autonomous data preparation. DeepPrep constructs data preparation pipelines through iterative, execution-grounded interaction with an environment that materializes intermediate table states and returns runtime feedback. To overcome the limitations of linear interaction, DeepPrep organizes pipeline construction with tree-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Machine Learning and Data Classification
