CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng

TL;DR
CMPhysBench is a comprehensive benchmark with over 520 graduate-level questions designed to evaluate large language models' proficiency in condensed matter physics, revealing significant performance gaps in current models.
Contribution
This paper introduces CMPhysBench, a novel benchmark with a new scoring metric, to assess LLMs in condensed matter physics, highlighting their current limitations.
Findings
Grok-4 achieves only 36 SEED score and 28% accuracy
Current LLMs show significant gaps in condensed matter physics understanding
SEED score provides nuanced evaluation of model predictions
Abstract
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best…
Peer Reviews
Decision·ICLR 2026 Poster
1. High Problem Difficulty Focuses exclusively on graduate-level material, comprising more than 520 meticulously curated questions that require LLMs to generate complete, step-by-step solutions for complex calculation problems. This moves beyond the limitations of high school or undergraduate benchmarks, demanding advanced mathematical rigor and conceptual understanding. 2. Expert-Aligned Metric The proposed Scalable Expression Edit Distance (SEED) score provides highly accurate, fine-grained
1. It would be better if the authors could discuss how the issues of LLM in this domain identified in the analysis could be mitigated in the future research. Currently, the analyses only show LLM can make multiple types of error and it is still unclear how to improve LLM to avoid such errors. Proposing potential solutions for the identified errors could further improve the contribution of the paper.
The paper's main strength is tackling a new, hard domain: graduate-level condensed matter physics. Most benchmarks are easier, so this is a needed step up. The SEED metric is also a big plus; it's a smart way to give partial credit on complex math answers instead of just right/wrong. This metric seems useful for other science benchmarks too. The testing of 18 models is thorough, and the error analysis in Figure 6 gives a good breakdown of why models fail, with "Concept and Model Misuse" being th
The main weakness I see is in the error analysis. The authors used GPT-4o to categorize all the model mistakes. While this is fast, it's not clear how accurate GPT-4o is at this task. It would be better if they had human experts check a sample of these to confirm the error breakdown. Also, the SEED score focuses on the final boxed answer. The prompt asks for step-by-step solutions, but it's not clear if the steps themselves are evaluated. A model could get the right answer with the wrong steps.
1, Scope and subject: Graduate level benchmark based on standard graduate textbooks, requiring complex step-by-step solutions across diverse answer types. Specifically, CMYPhysBench is a non MC/QA benchmark, so much more difficult. 2. Diversity and coverage: Clear balance between categories, and clear explanation and validation of the source material from which benchmark is derived from. The authors also perform strong analysis on failure modes, which are possibly actionable and of interest to
1. Relevance of SEED as an benchmark evaluation metric versus actual accuracy. The goal seems to reward partial correctness (which is understandable from an RL or intermediate reward feedback perspective), however in practice: does SEED actually properly weight when LMs make minor incorrect reasoning steps (or does it only purely give partial credit when LMs fail to decode a final correct answer)? Some more discussion on this would be helpful. Related, from model thinking trajectories, how wel
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Quantum many-body systems · Artificial Intelligence in Healthcare and Education
