EXP-Bench: Can AI Conduct AI Research Experiments?

Patrick Tser Jern Kon; Jiachen Liu; Xinyi Zhu; Qiuyi Ding; Jingjia Peng; Jiarong Xing; Yibo Huang; Yiming Qiu; Jayanth Srinivasa; Myungjin Lee; Mosharaf Chowdhury; Matei Zaharia; Ang Chen

arXiv:2505.24785·cs.AI·June 3, 2025

EXP-Bench: Can AI Conduct AI Research Experiments?

Patrick Tser Jern Kon, Jiachen Liu, Xinyi Zhu, Qiuyi Ding, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia, Ang Chen

PDF

1 Repo 1 Datasets 3 Reviews

TL;DR

EXP-Bench is a new benchmark that evaluates AI agents on their ability to conduct complete AI research experiments, highlighting current limitations and guiding future improvements.

Contribution

It introduces a semi-autonomous pipeline to extract and structure experimental tasks from AI papers, creating a comprehensive benchmark for AI research automation.

Findings

01

Leading LLM agents achieve 20-35% on individual tasks

02

Complete experiment success rate is only 0.5%

03

EXP-Bench enables systematic evaluation of AI research capabilities

Abstract

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper builds on previous work in evaluating ai agents on coding tasks in ML research well. While other papers have focused on reproducing scientific papers, the tasks in EXP-Bench go one level further and involve implementation of experiments given a high-level outline

Weaknesses

- Missing citations to directly relevant outstanding work [1] - Number of source papers on which tasks have been generated is rather low. - Set of chosen models are outdated and do not include latest agentic coding models (GPT-5, Claude Sonnet 4.5, etc.). Even if the contamination problem exists, an analysis of how well models memorize the papers part of the benchmarks could be a good sanity check - No human error analysis [1] https://arxiv.org/abs/2409.11363

Reviewer 02Rating 6Confidence 3

Strengths

Originality: The paper introduces a benchmark, EXP-Bench, targeting a rarely studied but crucial problem — evaluating AI agents’ ability to perform complete research experiments. This “end-to-end scientific experimentation” framing goes beyond existing reasoning or coding benchmarks, representing a clear conceptual advancement. Quality: The proposed semi-automated curation pipeline is technically well-motivated and methodologically sound. It combines multimodal extraction, code analysis, and exe

Weaknesses

1.Reliability of LLM-as-a-Judge evaluation: The benchmark relies exclusively on an LLM-based judge to assess design and conclusion correctness, without any reported human calibration. This raises concerns about evaluation reliability and potential self-consistency bias, since the same modeling paradigm being evaluated also defines the scoring criteria. Including a limited human cross-check or reporting human–LLM agreement statistics would make the results more credible. 2.High computational and

Reviewer 03Rating 8Confidence 3

Strengths

- building on existing work instead of reinventing everything from scratch. Using existing agent implementations and building on Inspect are doubt-reducing choices. - The use of a structured workflow and the multi-pass retrieval for generating the tasks—instead of just trying to one-shot LLMs - Including some amount of human review in the process, validating with a final human validation at the end of the task creation process - The use of multiple metrics and looking at the distribution of scor

Weaknesses

I think the main weakness is that by splitting the paper's attention across the task-generation pipeline and the results of the agents on the tasks, there isn't enough space to deeply explore either. Perhaps my strongest recommendation is to reduce the scope of the paper and focus on either the results of the agents on the tasks--with detailed analysis of agent transcripts, error modes, possible false positives or false negatives--or on the task creation pipeline and validating that these tasks

Code & Models

Repositories

just-curieous/curie
noneOfficial

Datasets

Just-Curieous/EXP-Bench
dataset· 49 dl
49 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.