Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

Paul Tschisgale; Holger Maus; Fabian Kieser; Ben Kroehs; Stefan Petersen; Peter Wulff

arXiv:2505.09438·physics.ed-ph·July 2, 2025

Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

Paul Tschisgale, Holger Maus, Fabian Kieser, Ben Kroehs, Stefan Petersen, Peter Wulff

PDF

TL;DR

This study evaluates GPT-4o and a reasoning-optimized LLM on physics Olympiad problems, finding they outperform humans and discussing implications for educational assessment and integrity.

Contribution

It provides a comparative analysis of LLMs' problem-solving abilities on physics Olympiad questions, highlighting their strengths and limitations.

Findings

01

Both LLMs outperform human participants on Olympiad problems.

02

Prompting techniques have minimal impact on GPT-4o's performance.

03

The reasoning-optimized model nearly always outperforms both GPT-4o and humans.

Abstract

Large language models (LLMs) are now widely accessible, reaching learners at all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in instruction and assessment, it is therefore essential to understand the physics-specific problem-solving capabilities of LLMs. Such understanding is key to informing responsible and pedagogically sound approaches to integrating LLMs into instruction and assessment. This study therefore compares the problem-solving performance of a general-purpose LLM (GPT-4o, using varying prompting techniques) and a reasoning-optimized model (o1-preview) with that of participants of the German Physics Olympiad, based on a set of well-defined Olympiad problems. In addition to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSeventeen Ways to Call Uphold Helpline Full Guide USA 24 Hour Assistance · Sparse Evolutionary Training