Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks

S. K. Rithvik

arXiv:2602.19006·cs.AI·February 24, 2026

Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks

S. K. Rithvik

PDF

Open Access

TL;DR

This study systematically evaluates large language models on quantum mechanics tasks, revealing tier-based performance, task difficulty patterns, and the effects of tool augmentation, providing a comprehensive benchmark and reproducibility insights.

Contribution

It introduces a quantum mechanics benchmark for LLMs, analyzes performance hierarchies, assesses tool augmentation effects, and characterizes reproducibility across models.

Findings

01

Flagship models achieve 81% accuracy, outperforming mid-tier and fast models.

02

Derivations have the highest performance, numerical tasks are most challenging.

03

Tool augmentation yields heterogeneous effects, with some tasks improving significantly.

Abstract

We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81\% average accuracy, outperforming mid-tier (77\%) and fast models (67\%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92\% average, 100\% for flagship models), while numerical computation remains most challenging (42\%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Quantum many-body systems · Quantum Computing Algorithms and Architecture