Can Coding Agents Reproduce Findings in Computational Materials Science?

Ziyang Huang; Yi Cao; Ali K. Shargh; Jing Luo; Ruidong Mei; Mohd Zaki; Zhan Liu; Wyatt Bunstine; William Jurayj; Somdatta Goswami; Tyrel McQueen; Michael Shields; Jaafar El-Awady; Paulette Clancy; Benjamin Van Durme; Nicholas Andrews; William Walden; Daniel Khashabi

arXiv:2605.00803·cs.SE·May 4, 2026

Can Coding Agents Reproduce Findings in Computational Materials Science?

Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei, Mohd Zaki, Zhan Liu, Wyatt Bunstine, William Jurayj, Somdatta Goswami, Tyrel McQueen, Michael Shields, Jaafar El-Awady, Paulette Clancy, Benjamin Van Durme, Nicholas Andrews, William Walden, Daniel Khashabi

PDF

TL;DR

AutoMat is a benchmark designed to evaluate whether large language model-based coding agents can accurately reproduce scientific claims in computational materials science, revealing significant current limitations.

Contribution

The paper introduces AutoMat, a novel benchmark for assessing LLM-based agents' ability to reproduce scientific claims in materials science workflows.

Findings

01

Best agent success rate is 54.1% on AutoMat.

02

Agents struggle with incomplete procedures and methodological deviations.

03

AutoMat serves as both a benchmark and diagnostic tool for AI in scientific reproducibility.

Abstract

Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims. To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.