RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Hjalmar Wijk; Tao Lin; Joel Becker; Sami Jawhar; Neev Parikh; Thomas Broadley; Lawrence Chan; Michael Chen; Josh Clymer; Jai Dhyani; Elena Ericheva; Katharyn Garcia; Brian Goodrich; Nikola Jurkovic; Holden Karnofsky; Megan Kinniment; Aron Lajko; Seraphina Nix; Lucas Sato; William Saunders; Maksym Taran; Ben West; Elizabeth Barnes

arXiv:2411.15114·cs.LG·May 28, 2025·2 cites

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Holden Karnofsky, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato

PDF

Open Access 2 Repos 1 Models

TL;DR

RE-Bench introduces a realistic benchmark for evaluating AI R&D capabilities against human experts, showing AI agents can outperform humans in speed and cost but humans excel with increased time investment.

Contribution

The paper presents RE-Bench, a new benchmark with real-world environments and human data, enabling direct comparison of AI agents and human experts in ML research tasks.

Findings

01

AI agents outperform humans in speed and cost for R&D tasks.

02

Humans achieve higher scores with increased time budgets.

03

Modern AI agents demonstrate significant expertise in ML topics.

Abstract

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
dexhunter/aideml
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)