MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
Sasi Kiran Gaddipati, Diyana Muhammed, Farhana Keya, Gollam Rabby, S\"oren Auer

TL;DR
MLReplicate is a comprehensive benchmark for evaluating autonomous research systems in machine learning, revealing significant gaps in scientific rigor and reproducibility despite advances in automation.
Contribution
This paper introduces MLReplicate, the first standardized, end-to-end benchmark for assessing the reproducibility and scientific validity of autonomous research systems.
Findings
Automated reviews accepted 10 out of 37 submissions.
Human reviewers identified flaws and unsupported claims in all systems.
Cost and token usage do not predict output quality.
Abstract
Autonomous research systems capable of generating complete scientific manuscripts have advanced rapidly, yet robust and realistic evaluation frameworks have failed to keep pace. To bridge this gap, we introduce MLReplicate, an end-to-end benchmark evaluating autonomous research systems on machine learning reproducibility. The benchmark was constructed from ICML 2025 outstanding papers reformulated into standardized input specifications and evaluated across 6 state-of-the-art research systems: AI SCIENTIST-V1, AI SCIENTIST-V2, AGENT LABORATORY, CYCLERESEARCHER, AI RESEARCHER, and TINY SCIENTIST, yielding 45 generated manuscripts, with 3 failed experiments. Outputs are assessed using a dual-protocol approach that combines automated conference-style review and structured expert human evaluation, while tracking computational cost, runtime, and the amount of required human intervention. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
