TL;DR
ARA formalizes reproducibility evaluation as a structured reasoning task, enabling scalable peer review by reconstructing workflows from scientific papers using large language models.
Contribution
It introduces a novel framework that automates reproducibility assessment through workflow graph extraction and evaluation, demonstrating high accuracy across multiple benchmarks.
Findings
Achieves approximately 61% accuracy on three benchmarks.
Outperforms existing methods on ReproBench and GoldStandardDB.
Demonstrates generalizability across scientific domains and LLM configurations.
Abstract
Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
