REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

Chuxuan Hu; Liyun Zhang; Yeji Lim; Aum Wadhwani; Austin Peters; Daniel Kang

arXiv:2507.18901·cs.CL·July 28, 2025

REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, Daniel Kang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces REPRO-Bench, a challenging benchmark for AI agents to assess the reproducibility of social science research, revealing current AI limitations and proposing improvements.

Contribution

We created REPRO-Bench, a diverse, real-world benchmark for AI-based reproducibility assessment, and developed REPRO-Agent, significantly enhancing accuracy over existing agents.

Findings

01

Best AI agent achieved 21.4% accuracy

02

REPRO-Agent improved accuracy by 71% over previous models

03

REPRO-Bench presents complex, real-world assessment tasks

Abstract

Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

chuxuan/REPRO-Bench
dataset· 4.2k dl
4.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Biomedical Text Mining and Ontologies · Research Data Management Practices