Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks
S. K. Rithvik

TL;DR
This study systematically evaluates large language models on quantum mechanics tasks, revealing tier-based performance, task difficulty patterns, and the effects of tool augmentation, providing a comprehensive benchmark and reproducibility insights.
Contribution
It introduces a quantum mechanics benchmark for LLMs, analyzes performance hierarchies, assesses tool augmentation effects, and characterizes reproducibility across models.
Findings
Flagship models achieve 81% accuracy, outperforming mid-tier and fast models.
Derivations have the highest performance, numerical tasks are most challenging.
Tool augmentation yields heterogeneous effects, with some tasks improving significantly.
Abstract
We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81\% average accuracy, outperforming mid-tier (77\%) and fast models (67\%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92\% average, 100\% for flagship models), while numerical computation remains most challenging (42\%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Quantum many-body systems · Quantum Computing Algorithms and Architecture
