Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics

Ryan Mok; Faraaz Akhtar; Louis Clare; Christine Li; Jun Ida; Lewis Ross; and Mario Campanelli

arXiv:2411.13685·physics.ed-ph·December 1, 2025

Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics

Ryan Mok, Faraaz Akhtar, Louis Clare, Christine Li, Jun Ida, Lewis Ross, and Mario Campanelli

PDF

Open Access 1 Repo

TL;DR

This study evaluates the effectiveness of large language models in grading undergraduate physics assessments, highlighting their potential and limitations, and proposes a method to improve AI grading accuracy using mark schemes.

Contribution

It introduces an empirical procedure to assess AI grading in physics, demonstrating how providing mark schemes enhances grading quality and analyzing topic-specific differences.

Findings

01

AI grading prone to errors and hallucinations

02

Providing mark schemes improves grading accuracy

03

Grading performance correlates with problem-solving ability

Abstract

Grading assessments is time-consuming and prone to human bias. Students may experience delays in receiving feedback that may not be tailored to their expectations or needs. Harnessing AI in education can be effective for grading undergraduate physics problems, enhancing the efficiency of undergraduate-level physics learning and teaching, and helping students understand concepts with the help of a constantly available tutor. This report devises a simple empirical procedure to investigate and quantify how well large language model (LLM) based AI chatbots can grade solutions to undergraduate physics problems in Classical Mechanics, Electromagnetic Theory and Quantum Mechanics, comparing humans against AI grading. The following LLMs were tested: Gemini 1.5 Pro, GPT-4, GPT-4o and Claude 3.5 Sonnet. The results show AI grading is prone to mathematical errors and hallucinations, which render…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

faraazakhtar185/LLM_Grader_Analysis
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOnline Learning and Analytics · Advanced Data Processing Techniques · Topic Modeling