Towards a Large Physics Benchmark
Kristian G. Barman, Sascha Caron, Faegheh Hasibi, Eugene Shalugin, Yoris Marcet, Johannes Otte, Henk W. de Regt, and Merijn Moody

TL;DR
This paper presents a comprehensive benchmark framework for evaluating large language models in fundamental physics, incorporating diverse question types and expert scoring to guide AI development in scientific research.
Contribution
It introduces a novel, community-driven physics benchmark with a multi-faceted scoring system and a living dataset to advance AI capabilities in physics understanding and problem solving.
Findings
Developed a diverse set of physics questions including conceptual, analytical, and open-ended types.
Implemented a scoring system based on correctness, difficulty, and surprise evaluated by experts.
Launched a living benchmark platform for continuous community contributions and updates.
Abstract
We introduce a benchmark framework developed by and for the scientific community to evaluate, monitor and steer large language model development in fundamental physics. Building on philosophical concepts of scientific understanding and creativity, we develop a scoring system in which each question is scored by an expert for its correctness, difficulty, and surprise. The questions are of three forms: (i) multiple-choice questions for conceptual understanding, (ii) analytical problems requiring mathematical derivation, and (iii) openended tasks requiring complex problem solving. Our current dataset contains diverse set of examples, including a machine learning challenge to classify high-energy physics events, such as the four top quark signal. To ensure continued relevance, we propose a living benchmark, where physicists contribute questions, for instance alongside new publications. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
