ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

Kevin Riehl; Andres L. Marin; Nikofors Zacharof; Fan Wu; Patrick Langer; Robert Jakob; Anastasios Kouvelas; Georgios Fontaras; Michail A. Makridis

arXiv:2605.02651·cs.DL·May 18, 2026

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

Kevin Riehl, Andres L. Marin, Nikofors Zacharof, Fan Wu, Patrick Langer, Robert Jakob, Anastasios Kouvelas, Georgios Fontaras, Michail A. Makridis

PDF

1 Repo

TL;DR

ARA formalizes reproducibility evaluation as a structured reasoning task, enabling scalable peer review by reconstructing workflows from scientific papers using large language models.

Contribution

It introduces a novel framework that automates reproducibility assessment through workflow graph extraction and evaluation, demonstrating high accuracy across multiple benchmarks.

Findings

01

Achieves approximately 61% accuracy on three benchmarks.

02

Outperforms existing methods on ReproBench and GoldStandardDB.

03

Demonstrates generalizability across scientific domains and LLM configurations.

Abstract

Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AndresLaverdeMarin/agentic_reproducibility_assessment
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.