PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu; Junyi Deng; Yiwei Deng; Haoran Dong; Jieyu Fu; Mao Li; Zeyu Li; Zhaolong Zhang; Huiwen Zheng; Leidong Bao; Anqi Lv; Zihan Mo; Yadi Niu; Yiyang Peng; Yu Tian; Yili Wang; Ziyu Wang; Zi-Yu Wang; Jiashen Wei; Liuheng Wu; Aoran Xue; Leyi Yang; Guanglu Yuan; Xiarui Zhan; Jingjun Zhang; Zifan Zheng; Pengfei Liu; Linrui Zhen; Kaiyang Li; Qichang Li; Ziheng Zhou; Guo-En Nian; Yunwei Xiao; Qing-Hong Cao; Linjie Dai; Xu Feng; Peng Gao; Ying Gu; Chang Liu; Jia Liu; Ming-xing Luo; Yan-Qing Ma; Liang-You Peng; Huichao Song; Shufeng Wang; Chenxu Wang; Tao Wang; Yi-Nan Wang; Chengyin Wu; Pengwei Zhao; and Hua Xing Zhu

arXiv:2603.27646·cs.CL·March 31, 2026

PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan

PDF

1 Repo

TL;DR

PRBench is a comprehensive benchmark testing AI agents' ability to understand, implement, and reproduce physics research papers' methodology and results, highlighting current capabilities and limitations.

Contribution

Introduces PRBench, a novel benchmark with 30 physics tasks for evaluating AI agents' end-to-end scientific reproduction abilities, grounded in real published papers.

Findings

01

OpenAI Codex with GPT-5.3-Codex scores 34% overall.

02

Agents struggle with data accuracy and code correctness.

03

Systematic errors include formula implementation and data fabrication.

Abstract

AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stephenqsstarthomas/PRBench-Eval-Handson
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.