Dataforge: Agentic Platform for Autonomous Data Engineering
Xinyuan Wang, Hongyu Cao, Kunpeng Liu, Yanjie Fu

TL;DR
Dataforge is an autonomous, AI-powered platform that automates data cleaning and feature optimization for tabular data, significantly improving downstream task performance and reducing the need for expert intervention.
Contribution
It introduces Dataforge, a novel LLM-based agentic system for autonomous data engineering that enhances data preparation efficiency and accuracy across diverse datasets.
Findings
Achieves state-of-the-art performance on tabular benchmarks
Effective iterative refinement improves data quality
Grounding enhances accuracy and reliability
Abstract
The growing demand for artificial intelligence (AI) applications in materials discovery, molecular modeling, and climate science has made data preparation a critical but labor-intensive bottleneck. Raw data from diverse sources must be cleaned, normalized, and transformed to become AI-ready, where effective feature transformation and selection are essential for robust learning. We present Dataforge, an LLM-powered agentic data engineering platform for tabular data that is automatic, safe, and non-expert friendly. It autonomously performs data cleaning and iteratively optimizes feature operations under a budgeted feedback loop with automatic stopping. Across tabular benchmarks, it achieves the best overall downstream performance; ablations further confirm the roles of routing/iterative refinement and grounding in accuracy and reliability. Dataforge demonstrates a practical path toward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Data Quality and Management · Machine Learning in Materials Science
