ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Bang Nguyen; Dominik So\'os; Qian Ma; Rochana R. Obadage; Zack Ranjan; Sai Koneru; Anna Szabelska; Adam Gill; Timothy M. Errington; Shakhlo Nematova; Sarah Rajtmajer; Jian Wu; Meng Jiang

arXiv:2602.11354·cs.AI·April 13, 2026

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Bang Nguyen, Dominik So\'os, Qian Ma, Rochana R. Obadage, Zack Ranjan, Sai Koneru, Anna Szabelska, Adam Gill, Timothy M. Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, Meng Jiang

PDF

1 Repo

TL;DR

ReplicatorBench is a comprehensive benchmark for evaluating AI agents' ability to replicate social and behavioral science research, focusing on data retrieval, experiment design, and result interpretation.

Contribution

It introduces a new benchmark with human-verified claims and a framework for assessing AI agents in real-world research replication tasks.

Findings

01

LLM agents effectively design and execute experiments

02

Agents struggle with retrieving new data for replication

03

Code and data are publicly available for further research

Abstract

The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground-truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent's ability to identify non-replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end-to-end benchmark, including human-verified replicable and non-replicable research claims in social and behavioral sciences for evaluating AI agents in research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CenterForOpenScience/llm-benchmarking
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.