RExBench: Can coding agents autonomously implement AI research extensions?
Nicholas Edwards, Yukyung Lee, Yujun Audrey Mao, Yulu Qin, Sebastian Schuster, Najoung Kim

TL;DR
RExBench is a new benchmark designed to evaluate whether large language model agents can autonomously extend research papers, revealing current limitations in their ability to handle complex research tasks without human guidance.
Contribution
The paper introduces RExBench, a benchmark for assessing LLM agents' capability to autonomously perform research extensions, and evaluates existing agents showing significant room for improvement.
Findings
All agents failed to autonomously implement most research extensions.
The best agent achieved only about 33% success rate without human hints.
Success rate improved to below 44% with additional human guidance.
Abstract
Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of realistic extensions of 12 research papers that aim to investigate novel research hypotheses. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
