LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Grace Byun; Swati Rajwal; Jinho D. Choi

arXiv:2511.10819·cs.CL·November 19, 2025

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Grace Byun, Swati Rajwal, Jinho D. Choi

PDF

Open Access

TL;DR

This paper evaluates GPT-4o's ability to grade short-answer quizzes and reports, demonstrating high correlation with human grading but noting some variability, thus exploring LLMs' practical use in educational assessment.

Contribution

It provides empirical evidence on GPT-4o's effectiveness for grading in real classrooms, highlighting its potential and limitations compared to human evaluators.

Findings

01

GPT-4o achieves up to 0.98 correlation with human scores

02

Exact score agreement in 55% of quiz cases

03

Strong overall alignment with human grading but variability in technical responses

Abstract

Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification