ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, Diyi Yang, Risa Wechsler, Ioana Ciuca

TL;DR
ReplicationBench is a new evaluation framework that tests whether AI agents can accurately replicate astrophysics research papers, revealing significant challenges and failure modes for current models in scientific tasks.
Contribution
It introduces the first benchmark for assessing AI agents' ability to faithfully replicate entire scientific papers in astrophysics, with expert validation and detailed analysis.
Findings
Current models score under 20% on replication tasks
Replicating scientific papers is extremely challenging for AI agents
The framework uncovers diverse failure modes in AI scientific research
Abstract
Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper peer-reviewed astrophysics papers with real datasets and collaboration with the authors to generate tasks - Qualitative failure analysis with human experts - Thorough evaluations in the experimental section report token runtime efficiency, error bars, and five frontier LLMs - Focus on Physics as domain going beyond ML.
- Simple scaffold used for agent implementation: I would like to see a more task-specific scaffold with optimized prompt and tooling to better understand the absolute performance claims we can make from the benchmark. - The benchmark only includes tasks from a small number of papers. Overall number of tasks is also low with only 107/165. This produces large error bars. - Together with point 2: 20% of runs showed some signs of cheating/memorization by the agent of results in the paper. Adding mo
1. Using actual astrophysics papers with real data and non-trivial tooling evaluates the natural scientific capabilities of an agent, unlike SWE-Bench or MLE-Bench. 2. Objectively gradable numeric and coding tasks make evaluation cheaper and reproducible.
1. The paper sort of positions ReplicationBench as the "first" at paper-scale replication, but works like PaperBench, DiscoveryBench, exist to evaluate agent capabilities across multiple domains. This paper is, in practice, an instance of that paradigm specialized to astrophysics. 2. A conference/journal for astrophysics could be more suitable for this paper. It is not clear what the insight is here beyond that longer horizon, real-domain scientific tasks are hard for agents. This result has bee
1. The paper ensures the proposed benchmark’s real-world usefulness by grounding the tasks on published papers and working with their expert authors to create the tasks and analyze LLM performance qualitatively. 2. The tasks in this benchmark take several important factors into consideration, such as coverage, objectivity, and guess-proofing.
1. A substantive assessment of the weaknesses of the paper. Focus on constructive and actionable insights on how the work could improve towards its stated goals. Be specific, avoid generic remarks. For example, if you believe the contribution lacks novelty, provide references and an explanation as evidence; if you believe experiments are insufficient, explain why and exactly what is missing, etc. 2. This benchmark focuses solely on astrophysics tasks. However, the paper does not clearly position
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
