Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Ming Ding; Rasmus Kyng; Federico Solda; Weixuan Yuan

arXiv:2505.13664·cs.CY·May 21, 2025

Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Ming Ding, Rasmus Kyng, Federico Solda, Weixuan Yuan

PDF

Open Access

TL;DR

This study evaluates GPT-4 models' ability to solve university-level algorithms problems under blind grading, revealing strengths and weaknesses in their reasoning and performance compared to students.

Contribution

It provides a novel assessment of GPT-4's problem-solving capabilities in a real educational setting with blind grading, highlighting specific model limitations.

Findings

01

GPT-4o fails to pass the exam.

02

o1-preview surpasses passing score and student median.

03

Both models show issues with unjustified claims.

Abstract

As large language models (LLMs) advance, their role in higher education, particularly in free-response problem-solving, requires careful examination. This study assesses the performance of GPT-4o and o1-preview under realistic educational conditions in an undergraduate algorithms course. Anonymous GPT-generated solutions to take-home exams were graded by teaching assistants unaware of their origin. Our analysis examines both coarse-grained performance (scores) and fine-grained reasoning quality (error patterns). Results show that GPT-4o consistently struggles, failing to reach the passing threshold, while o1-preview performs significantly better, surpassing the passing score and even exceeding the student median in certain exercises. However, both models exhibit issues with unjustified claims and misleading arguments. These findings highlight the need for robust assessment strategies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHigher Education Learning Practices · Numerical Methods and Algorithms