CORE-Bench: Fostering the Credibility of Published Research Through a   Computational Reproducibility Agent Benchmark

Zachary S. Siegel; Sayash Kapoor; Nitya Nagdir; Benedikt Stroebl,; Arvind Narayanan

arXiv:2409.11363·cs.CL·September 18, 2024·3 cites

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl,, Arvind Narayanan

PDF

Open Access 2 Repos 1 Datasets

TL;DR

CORE-Bench is a new benchmark designed to evaluate AI agents' ability to reproduce scientific research results across multiple disciplines, aiming to improve scientific credibility and automate routine research tasks.

Contribution

The paper introduces CORE-Bench, a comprehensive benchmark with 270 tasks across disciplines, and an evaluation system to measure AI agents' effectiveness in computational reproducibility.

Findings

01

Best agent achieved 21% accuracy on hardest tasks

02

Evaluation system significantly reduces testing time

03

Baseline agents show substantial room for improvement

Abstract

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

siegelz/core-bench
dataset· 373 dl
373 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management