DataMaster: Data-Centric Autonomous AI Research
Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei, Zimeng Chen, Fenyi Liu, Haotian Wu, Yuzhu Cai, Zexi Liu, Xinyu Zhu, WenHao Wang, Linfeng Zhang, Chen Qian, Siheng Chen

TL;DR
DataMaster introduces an autonomous data engineering framework that optimizes data selection, discovery, and transformation to enhance machine learning performance without changing the underlying algorithms.
Contribution
It presents a novel data-agent framework with tree-structured search, shared data pools, and memory components for autonomous data engineering in ML systems.
Findings
Improves medal rate by 32.27% on MLE-Bench Lite.
Surpasses instruct model on GPQA with 31.02% accuracy.
Demonstrates effective autonomous data optimization in benchmarks.
Abstract
As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
