From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking

Gyeongwon James Kim; Alex Wilf; Louis-Philippe Morency; Daniel Fried

arXiv:2506.19724·cs.AI·June 25, 2025

From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking

Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, Daniel Fried

PDF

Open Access 1 Repo 3 Reviews

TL;DR

AutoExperiment is a new benchmark that assesses AI agents' ability to implement, run, and reproduce machine learning experiments from research papers, with performance decreasing as the task complexity increases, highlighting key challenges in AI scientific automation.

Contribution

We introduce AutoExperiment, a scalable benchmark for evaluating AI agents' ability to perform scientific code reproduction and replication from natural language descriptions.

Findings

01

Performance drops as the number of missing functions increases

02

Environment-interacting agents outperform fixed-harness agents

03

Significant gap between single-shot and multi-trial success rates

Abstract

Recent progress in autonomous code generation has fueled excitement around AI agents capable of accelerating scientific discovery by running experiments. However, there is currently no benchmark that evaluates whether such agents can implement scientific ideas when given varied amounts of code as a starting point, interpolating between reproduction (running code) and from-scratch replication (fully re-implementing and running code). We introduce AutoExperiment, a benchmark that evaluates AI agents' ability to implement and run machine learning experiments based on natural language descriptions in research papers. In each task, agents are given a research paper, a codebase with key functions masked out, and a command to run the experiment. The goal is to generate the missing code, execute the experiment in a sandboxed environment, and reproduce the results. AutoExperiment scales in…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper proposes a new benchmark for evaluating the performance of agents on machine learning code replication tasks. 2. It evaluates the performance of different models and agents under varying task settings (i.e., partial code replication and full replication).

Weaknesses

1. Novelty of the Benchmark: The benchmark is presented in two settings: partial code replication and full code replication. For full replication, a significant body of related research already exists. The partial replication setting closely resembles many existing code-filling tasks studied in the context of code models. 2. Rigorousness of Experimental Setup: The paper uses the number of masked functions (n) as its primary analysis target and difficulty metric, concluding that a larger 'n' decr

Reviewer 02Rating 6Confidence 3

Strengths

- Appropriate and opportune benchmark - scientific coding is rapidly becoming a major research area, and we still lack good benchmarks - Clear and well written - Builds on prior work (MLRC) nicely - Nicely scalable levels of difficulty in the benchmark

Weaknesses

- Most significantly, this paper seems very similar to "ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code" (https://arxiv.org/abs/2506.02314), significantly weakening the paper's claims of novelty. Please describe the key differences/benefits of your work to this prior paper. - The stated findings from the experiments seem obvious (e.g., performance degrades with more masking; debugging helps). What surprising/discoveries did you make in your experiments?

Reviewer 03Rating 6Confidence 4

Strengths

- ***Solid compute setup***: The computational setup for running the benchmark is decent, and well beyond what many other papers on scientific reproduction using AI agents use. - ***Focus on reproducible results (such as by using Docker)***: I appreciate the authors' efforts to make their study reproducible, including using Docker, and making all of their code and analysis openly available. - ***Qualitative analysis***: Many points in the paper include some qualitative analysis, which is quite

Weaknesses

- In the introduction, there's a typo ("agentive", should this be "agentic"?) - No train-test split: As far as I understand, there's no "train" split in the data. How should researchers using the benchmark optimize and develop their agents without a train set? This raises concerns about leakage (even in the results reported in the paper, since there are many ablations, for example, in the ReAct agents) - No reporting of cost, time taken by agents (which can be important for real-world use) - Why

Code & Models

Repositories

j1mk1m/autoexperiment
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Software Engineering Research · Model-Driven Software Engineering Techniques