Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI   Reasoning Capabilities in Theoretical Physics

Daniel J.H. Chung; Zhiqi Gao; Yurii Kvasiuk; Tianyi Li; Moritz; M\"unchmeyer; Maja Rudolph; Frederic Sala; Sai Chaitanya Tadepalli

arXiv:2502.15815·cs.LG·February 25, 2025

Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics

Daniel J.H. Chung, Zhiqi Gao, Yurii Kvasiuk, Tianyi Li, Moritz, M\"unchmeyer, Maja Rudolph, Frederic Sala, Sai Chaitanya Tadepalli

PDF

Open Access 1 Datasets

TL;DR

This paper introduces TPBench, a new benchmark dataset of 57 theoretical physics problems designed to evaluate AI reasoning in high-energy physics and cosmology, highlighting current limitations and future potential.

Contribution

It presents a novel dataset with research-level physics problems, evaluates multiple AI models, and discusses challenges and strategies for AI to assist in theoretical physics research.

Findings

01

Recent models show impressive progress but still struggle with research-level problems.

02

Most research-level problems remain unsolved by current AI models.

03

Challenges include auto-verifiability, grading, and failure modes.

Abstract

We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We address challenges of auto-verifiability and grading, and discuss common failure modes. While currently state-of-the art models are still of limited use for researchers, our results show that AI assisted theoretical physics research may become…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ZhiqiGao/TPBench
dataset· 47 dl
47 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)

MethodsSparse Evolutionary Training · LLaMA