R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

Xu Yang; Xiao Yang; Shikai Fang; Yifei Zhang; Jian Wang; Bowen Xian; Qizheng Li; Jingyuan Li; Minrui Xu; Yuante Li; Haoran Pan; Yuge Zhang; Weiqing Liu; Yelong Shen; Weizhu Chen; and Jiang Bian

arXiv:2505.14738·cs.AI·October 2, 2025

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, and Jiang Bian

PDF

Open Access 1 Repo 4 Reviews

TL;DR

R&D-Agent is a new framework that formalizes the machine learning engineering process into a structured, testable workflow, enabling the development of agents that outperform existing solutions in data science tasks.

Contribution

The paper introduces R&D-Agent, a decoupled, extensible framework that formalizes MLE into phases and components, facilitating the design of high-performance autonomous data science agents.

Findings

01

R&D-Agent-based agents achieve state-of-the-art performance on MLE-Bench.

02

The top-performing agent has a 35.1% medal rate, outperforming existing solutions.

03

The framework accelerates innovation and enhances accuracy in data science applications.

Abstract

Recent advances in AI and ML have transformed data science, yet increasing complexity and expertise requirements continue to hinder progress. Although crowd-sourcing platforms alleviate some challenges, high-level machine learning engineering (MLE) tasks remain labor-intensive and iterative. We introduce R&D-Agent, a comprehensive, decoupled, and extensible framework that formalizes the MLE process. R&D-Agent defines the MLE workflow into two phases and six components, turning agent design for MLE from ad-hoc craftsmanship into a principled, testable process. Although several existing agents report promising gains on their chosen components, they can mostly be summarized as a partial optimization from our framework's simple baseline. Inspired by human experts, we designed efficient and effective agents within this framework that achieve state-of-the-art performance. Evaluated on…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

The paper shows leading “Any Medal” performance on MLE-Bench when using the most recent GPT-5 model.

Weaknesses

1. Regarding “Gold” performance, it still does not perform as well as ML-Master that uses Deepseek-R1 released approximately half year earlier than GPT-5. It is recommended to test the performance of the proposed agent system using Deepseek-R1 for fair comparison. 2. The paper claims “All the existing methods can be summarized as a partial optimization from our framework’s simple baseline”. It is obviously an over-claim since many existing methods were published/released earlier than this work,

Reviewer 02Rating 2Confidence 4

Strengths

Originality - Interesting 6 component multi-agent system. Mainly an integration of existing approaches rather than new approaches Quality - very comprehensive and rigorous evaluations conducted Clarity - Well-written paper: well exposed, well documented in the appendix and clearly positioned Significance - Shows it outperforms other methods on ML research tasks

Weaknesses

- Limited novelty: Individual components use established techniques (MCTS, tree search, memory systems, iterative debugging). This paper is also not the first to integrate many of these into agents for ML engineering and data science. e.g. the MCTS figures even look similar to Climb-DC and and the path structure mechanism like DataInterpreter - Many missing baselines: DataInterpreter, Climb-DC, MLCopilot, LAMBDA, AutoML-Agent --- please can you outline the differences to these works - Only clo

Reviewer 03Rating 4Confidence 4

Strengths

1. The core idea of separating the research and development phases is interesting and conceptually aligns with how human ML practitioners operate. 2. The experimental results on MLE-bench are promising and suggest potential for structured LLM-driven engineering pipelines.

Weaknesses

Deeper analysis is needed: 1. While the paper includes baseline comparisons and ablation studies, it is unclear what limits the overall system performance. Is the bottleneck primarily in the research phase (e.g., limited exploration despite the proposed reasoning techniques), or in the development phase (e.g., LLMs struggling to reliably use external libraries or tools)? A more detailed investigation could clarify this. 2. The benchmark used in the paper, although containing 40 competitions, app

Reviewer 04Rating 2Confidence 4

Strengths

- A large number of experiments on the whole set of MLE-Bench are conducted. - The writing is of good quality. - Ablation studies on a subset of MLE-Bench are conducted to provide the effectiveness of the proposed framework.

Weaknesses

- The novelty of this paper is limited. All the contributions are more like engineering efforts and small tricks. There is no fundamental technical contribution in this paper. Also, all the techniques are already investigated by the community. I believe this paper would not bring novel technical insights for the community. - From the perspective of empirical findings, although this paper performs large-scale experiments on MLE-Bench, the experiment design is not convincing enough for ensuring f

Code & Models

Repositories

microsoft/rd-agent
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Business Intelligence · Data Mining Algorithms and Applications