CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Weida Wang; Dongchen Huang; Jiatong Li; Tengchao Yang; Ziyang Zheng; Di Zhang; Dong Han; Benteng Chen; Binzhao Luo; Zhiyu Liu; Kunling Liu; Zhiyuan Gao; Shiqi Geng; Wei Ma; Jiaming Su; Xin Li; Shuchen Pu; Yuhan Shui; Qianjia Cheng; Zhihao Dou; Dongfei Cui; Changyong He; Jin Zeng; Zeke Xie; Mao Su; Dongzhan Zhou; Yuqiang Li; Wanli Ouyang; Yunqi Cai; Xi Dai; Shufei Zhang; Lei Bai; Jinguang Cheng; Zhong Fang; Hongming Weng

arXiv:2508.18124·cs.LG·September 1, 2025

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

CMPhysBench is a comprehensive benchmark with over 520 graduate-level questions designed to evaluate large language models' proficiency in condensed matter physics, revealing significant performance gaps in current models.

Contribution

This paper introduces CMPhysBench, a novel benchmark with a new scoring metric, to assess LLMs in condensed matter physics, highlighting their current limitations.

Findings

01

Grok-4 achieves only 36 SEED score and 28% accuracy

02

Current LLMs show significant gaps in condensed matter physics understanding

03

SEED score provides nuanced evaluation of model predictions

Abstract

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. High Problem Difficulty Focuses exclusively on graduate-level material, comprising more than 520 meticulously curated questions that require LLMs to generate complete, step-by-step solutions for complex calculation problems. This moves beyond the limitations of high school or undergraduate benchmarks, demanding advanced mathematical rigor and conceptual understanding. 2. Expert-Aligned Metric The proposed Scalable Expression Edit Distance (SEED) score provides highly accurate, fine-grained

Weaknesses

1. It would be better if the authors could discuss how the issues of LLM in this domain identified in the analysis could be mitigated in the future research. Currently, the analyses only show LLM can make multiple types of error and it is still unclear how to improve LLM to avoid such errors. Proposing potential solutions for the identified errors could further improve the contribution of the paper.

Reviewer 02Rating 4Confidence 3

Strengths

The paper's main strength is tackling a new, hard domain: graduate-level condensed matter physics. Most benchmarks are easier, so this is a needed step up. The SEED metric is also a big plus; it's a smart way to give partial credit on complex math answers instead of just right/wrong. This metric seems useful for other science benchmarks too. The testing of 18 models is thorough, and the error analysis in Figure 6 gives a good breakdown of why models fail, with "Concept and Model Misuse" being th

Weaknesses

The main weakness I see is in the error analysis. The authors used GPT-4o to categorize all the model mistakes. While this is fast, it's not clear how accurate GPT-4o is at this task. It would be better if they had human experts check a sample of these to confirm the error breakdown. Also, the SEED score focuses on the final boxed answer. The prompt asks for step-by-step solutions, but it's not clear if the steps themselves are evaluated. A model could get the right answer with the wrong steps.

Reviewer 03Rating 6Confidence 4

Strengths

1, Scope and subject: Graduate level benchmark based on standard graduate textbooks, requiring complex step-by-step solutions across diverse answer types. Specifically, CMYPhysBench is a non MC/QA benchmark, so much more difficult. 2. Diversity and coverage: Clear balance between categories, and clear explanation and validation of the source material from which benchmark is derived from. The authors also perform strong analysis on failure modes, which are possibly actionable and of interest to

Weaknesses

1. Relevance of SEED as an benchmark evaluation metric versus actual accuracy. The goal seems to reward partial correctness (which is understandable from an RL or intermediate reward feedback perspective), however in practice: does SEED actually properly weight when LMs make minor incorrect reasoning steps (or does it only purely give partial credit when LMs fail to decode a final correct answer)? Some more discussion on this would be helpful. Related, from model thinking trajectories, how wel

Code & Models

Datasets

weidawang/CMPhysBench
dataset· 104 dl
104 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Quantum many-body systems · Artificial Intelligence in Healthcare and Education