DataMaster: Data-Centric Autonomous AI Research

Yaxin Du; Xiyuan Yang; Zhifan Zhou; Wanxu Liu; Zixing Lei; Zimeng Chen; Fenyi Liu; Haotian Wu; Yuzhu Cai; Zexi Liu; Xinyu Zhu; WenHao Wang; Linfeng Zhang; Chen Qian; Siheng Chen

arXiv:2605.10906·cs.LG·May 14, 2026

DataMaster: Data-Centric Autonomous AI Research

Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei, Zimeng Chen, Fenyi Liu, Haotian Wu, Yuzhu Cai, Zexi Liu, Xinyu Zhu, WenHao Wang, Linfeng Zhang, Chen Qian, Siheng Chen

PDF

TL;DR

DataMaster introduces an autonomous data engineering framework that optimizes data selection, discovery, and transformation to enhance machine learning performance without changing the underlying algorithms.

Contribution

It presents a novel data-agent framework with tree-structured search, shared data pools, and memory components for autonomous data engineering in ML systems.

Findings

01

Improves medal rate by 32.27% on MLE-Bench Lite.

02

Surpasses instruct model on GPQA with 31.02% accuracy.

03

Demonstrates effective autonomous data optimization in benchmarks.

Abstract

As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.