EXP-Bench: Can AI Conduct AI Research Experiments?
Patrick Tser Jern Kon, Jiachen Liu, Xinyi Zhu, Qiuyi Ding, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia, Ang Chen

TL;DR
EXP-Bench is a new benchmark that evaluates AI agents on their ability to conduct complete AI research experiments, highlighting current limitations and guiding future improvements.
Contribution
It introduces a semi-autonomous pipeline to extract and structure experimental tasks from AI papers, creating a comprehensive benchmark for AI research automation.
Findings
Leading LLM agents achieve 20-35% on individual tasks
Complete experiment success rate is only 0.5%
EXP-Bench enables systematic evaluation of AI research capabilities
Abstract
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper builds on previous work in evaluating ai agents on coding tasks in ML research well. While other papers have focused on reproducing scientific papers, the tasks in EXP-Bench go one level further and involve implementation of experiments given a high-level outline
- Missing citations to directly relevant outstanding work [1] - Number of source papers on which tasks have been generated is rather low. - Set of chosen models are outdated and do not include latest agentic coding models (GPT-5, Claude Sonnet 4.5, etc.). Even if the contamination problem exists, an analysis of how well models memorize the papers part of the benchmarks could be a good sanity check - No human error analysis [1] https://arxiv.org/abs/2409.11363
Originality: The paper introduces a benchmark, EXP-Bench, targeting a rarely studied but crucial problem — evaluating AI agents’ ability to perform complete research experiments. This “end-to-end scientific experimentation” framing goes beyond existing reasoning or coding benchmarks, representing a clear conceptual advancement. Quality: The proposed semi-automated curation pipeline is technically well-motivated and methodologically sound. It combines multimodal extraction, code analysis, and exe
1.Reliability of LLM-as-a-Judge evaluation: The benchmark relies exclusively on an LLM-based judge to assess design and conclusion correctness, without any reported human calibration. This raises concerns about evaluation reliability and potential self-consistency bias, since the same modeling paradigm being evaluated also defines the scoring criteria. Including a limited human cross-check or reporting human–LLM agreement statistics would make the results more credible. 2.High computational and
- building on existing work instead of reinventing everything from scratch. Using existing agent implementations and building on Inspect are doubt-reducing choices. - The use of a structured workflow and the multi-pass retrieval for generating the tasks—instead of just trying to one-shot LLMs - Including some amount of human review in the process, validating with a final human validation at the end of the task creation process - The use of multiple metrics and looking at the distribution of scor
I think the main weakness is that by splitting the paper's attention across the task-generation pipeline and the results of the agents on the tasks, there isn't enough space to deeply explore either. Perhaps my strongest recommendation is to reduce the scope of the paper and focus on either the results of the agents on the tasks--with detailed analysis of agent transcripts, error modes, possible false positives or false negatives--or on the task creation pipeline and validating that these tasks
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
