REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?
Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, Daniel Kang

TL;DR
This paper introduces REPRO-Bench, a challenging benchmark for AI agents to assess the reproducibility of social science research, revealing current AI limitations and proposing improvements.
Contribution
We created REPRO-Bench, a diverse, real-world benchmark for AI-based reproducibility assessment, and developed REPRO-Agent, significantly enhancing accuracy over existing agents.
Findings
Best AI agent achieved 21.4% accuracy
REPRO-Agent improved accuracy by 71% over previous models
REPRO-Bench presents complex, real-world assessment tasks
Abstract
Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Biomedical Text Mining and Ontologies · Research Data Management Practices
