R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science
Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, and Jiang Bian

TL;DR
R&D-Agent is a new framework that formalizes the machine learning engineering process into a structured, testable workflow, enabling the development of agents that outperform existing solutions in data science tasks.
Contribution
The paper introduces R&D-Agent, a decoupled, extensible framework that formalizes MLE into phases and components, facilitating the design of high-performance autonomous data science agents.
Findings
R&D-Agent-based agents achieve state-of-the-art performance on MLE-Bench.
The top-performing agent has a 35.1% medal rate, outperforming existing solutions.
The framework accelerates innovation and enhances accuracy in data science applications.
Abstract
Recent advances in AI and ML have transformed data science, yet increasing complexity and expertise requirements continue to hinder progress. Although crowd-sourcing platforms alleviate some challenges, high-level machine learning engineering (MLE) tasks remain labor-intensive and iterative. We introduce R&D-Agent, a comprehensive, decoupled, and extensible framework that formalizes the MLE process. R&D-Agent defines the MLE workflow into two phases and six components, turning agent design for MLE from ad-hoc craftsmanship into a principled, testable process. Although several existing agents report promising gains on their chosen components, they can mostly be summarized as a partial optimization from our framework's simple baseline. Inspired by human experts, we designed efficient and effective agents within this framework that achieve state-of-the-art performance. Evaluated on…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper shows leading “Any Medal” performance on MLE-Bench when using the most recent GPT-5 model.
1. Regarding “Gold” performance, it still does not perform as well as ML-Master that uses Deepseek-R1 released approximately half year earlier than GPT-5. It is recommended to test the performance of the proposed agent system using Deepseek-R1 for fair comparison. 2. The paper claims “All the existing methods can be summarized as a partial optimization from our framework’s simple baseline”. It is obviously an over-claim since many existing methods were published/released earlier than this work,
Originality - Interesting 6 component multi-agent system. Mainly an integration of existing approaches rather than new approaches Quality - very comprehensive and rigorous evaluations conducted Clarity - Well-written paper: well exposed, well documented in the appendix and clearly positioned Significance - Shows it outperforms other methods on ML research tasks
- Limited novelty: Individual components use established techniques (MCTS, tree search, memory systems, iterative debugging). This paper is also not the first to integrate many of these into agents for ML engineering and data science. e.g. the MCTS figures even look similar to Climb-DC and and the path structure mechanism like DataInterpreter - Many missing baselines: DataInterpreter, Climb-DC, MLCopilot, LAMBDA, AutoML-Agent --- please can you outline the differences to these works - Only clo
1. The core idea of separating the research and development phases is interesting and conceptually aligns with how human ML practitioners operate. 2. The experimental results on MLE-bench are promising and suggest potential for structured LLM-driven engineering pipelines.
Deeper analysis is needed: 1. While the paper includes baseline comparisons and ablation studies, it is unclear what limits the overall system performance. Is the bottleneck primarily in the research phase (e.g., limited exploration despite the proposed reasoning techniques), or in the development phase (e.g., LLMs struggling to reliably use external libraries or tools)? A more detailed investigation could clarify this. 2. The benchmark used in the paper, although containing 40 competitions, app
- A large number of experiments on the whole set of MLE-Bench are conducted. - The writing is of good quality. - Ablation studies on a subset of MLE-Bench are conducted to provide the effectiveness of the proposed framework.
- The novelty of this paper is limited. All the contributions are more like engineering efforts and small tricks. There is no fundamental technical contribution in this paper. Also, all the techniques are already investigated by the community. I believe this paper would not bring novel technical insights for the community. - From the perspective of empirical findings, although this paper performs large-scale experiments on MLE-Bench, the experiment design is not convincing enough for ensuring f
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence · Data Mining Algorithms and Applications
