ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

Christine Ye; Sihan Yuan; Suchetha Cooray; Steven Dillmann; Ian L. V. Roque; Dalya Baron; Philipp Frank; Sergio Martin-Alvarez; Nolan Koblischke; Frank J Qu; Diyi Yang; Risa Wechsler; Ioana Ciuca

arXiv:2510.24591·cs.CL·November 25, 2025

ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, Diyi Yang, Risa Wechsler, Ioana Ciuca

PDF

1 Datasets 3 Reviews

TL;DR

ReplicationBench is a new evaluation framework that tests whether AI agents can accurately replicate astrophysics research papers, revealing significant challenges and failure modes for current models in scientific tasks.

Contribution

It introduces the first benchmark for assessing AI agents' ability to faithfully replicate entire scientific papers in astrophysics, with expert validation and detailed analysis.

Findings

01

Current models score under 20% on replication tasks

02

Replicating scientific papers is extremely challenging for AI agents

03

The framework uncovers diverse failure modes in AI scientific research

Abstract

Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

- The paper peer-reviewed astrophysics papers with real datasets and collaboration with the authors to generate tasks - Qualitative failure analysis with human experts - Thorough evaluations in the experimental section report token runtime efficiency, error bars, and five frontier LLMs - Focus on Physics as domain going beyond ML.

Weaknesses

- Simple scaffold used for agent implementation: I would like to see a more task-specific scaffold with optimized prompt and tooling to better understand the absolute performance claims we can make from the benchmark. - The benchmark only includes tasks from a small number of papers. Overall number of tasks is also low with only 107/165. This produces large error bars. - Together with point 2: 20% of runs showed some signs of cheating/memorization by the agent of results in the paper. Adding mo

Reviewer 02Rating 2Confidence 3

Strengths

1. Using actual astrophysics papers with real data and non-trivial tooling evaluates the natural scientific capabilities of an agent, unlike SWE-Bench or MLE-Bench. 2. Objectively gradable numeric and coding tasks make evaluation cheaper and reproducible.

Weaknesses

1. The paper sort of positions ReplicationBench as the "first" at paper-scale replication, but works like PaperBench, DiscoveryBench, exist to evaluate agent capabilities across multiple domains. This paper is, in practice, an instance of that paradigm specialized to astrophysics. 2. A conference/journal for astrophysics could be more suitable for this paper. It is not clear what the insight is here beyond that longer horizon, real-domain scientific tasks are hard for agents. This result has bee

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper ensures the proposed benchmark’s real-world usefulness by grounding the tasks on published papers and working with their expert authors to create the tasks and analyze LLM performance qualitatively. 2. The tasks in this benchmark take several important factors into consideration, such as coverage, objectivity, and guess-proofing.

Weaknesses

1. A substantive assessment of the weaknesses of the paper. Focus on constructive and actionable insights on how the work could improve towards its stated goals. Be specific, avoid generic remarks. For example, if you believe the contribution lacks novelty, provide references and an explanation as evidence; if you believe experiments are insufficient, explain why and exactly what is missing, etc. 2. This benchmark focuses solely on astrophysics tasks. However, the paper does not clearly position

Code & Models

Datasets

ChristineYe8/ReplicationBench
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.