AInsteinBench: Benchmarking Coding Agents on Scientific Repositories

Titouan Duston; Shuo Xin; Yang Sun; Daoguang Zan; Aoyan Li; Shulin Xin; Kai Shen; Yixiao Chen; Qiming Sun; Ge Zhang; Jiashuo Liu; Huan Zhou; Jingkai Liu; Zhichen Pu; Yuanheng Wang; Bo-Xuan Ge; Xin Tong; Fei Ye; Zhi-Chao Zhao; Wen-Biao Han; Zhoujian Cao; Yueran Zhao; Weiluo Ren; Qingshen Long; Yuxiao Liu; Anni Huang; Yidi Du; Yuanyuan Rong; Jiahao Peng

arXiv:2512.21373·cs.SE·December 29, 2025

AInsteinBench: Benchmarking Coding Agents on Scientific Repositories

Titouan Duston, Shuo Xin, Yang Sun, Daoguang Zan, Aoyan Li, Shulin Xin, Kai Shen, Yixiao Chen, Qiming Sun, Ge Zhang, Jiashuo Liu, Huan Zhou, Jingkai Liu, Zhichen Pu, Yuanheng Wang, Bo-Xuan Ge, Xin Tong, Fei Ye, Zhi-Chao Zhao, Wen-Biao Han, Zhoujian Cao, Yueran Zhao, Weiluo Ren

PDF

Open Access

TL;DR

AInsteinBench is a comprehensive benchmark designed to evaluate large language models' ability to perform scientific coding tasks within real research software ecosystems, emphasizing end-to-end scientific development capabilities.

Contribution

It introduces a novel, large-scale benchmark based on real scientific repositories, focusing on scientific development tasks rather than generic coding or reasoning, with rigorous filtering and expert review.

Findings

01

Models are tested in executable environments to assess real-world scientific coding skills.

02

Benchmark tasks cover diverse scientific domains including quantum chemistry and fluid dynamics.

03

Evaluation highlights the gap between surface-level code generation and core scientific research competencies.

Abstract

We introduce AInsteinBench, a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents within real research software ecosystems. Unlike existing scientific reasoning benchmarks which focus on conceptual knowledge, or software engineering benchmarks that emphasize generic feature implementation and issue resolving, AInsteinBench evaluates models in end-to-end scientific development settings grounded in production-grade scientific repositories. The benchmark consists of tasks derived from maintainer-authored pull requests across six widely used scientific codebases, spanning quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics. All benchmark tasks are carefully curated through multi-stage filtering and expert review to ensure scientific challenge,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Machine Learning in Materials Science · Topic Modeling