LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation
Grace Byun, Swati Rajwal, Jinho D. Choi

TL;DR
This paper evaluates GPT-4o's ability to grade short-answer quizzes and reports, demonstrating high correlation with human grading but noting some variability, thus exploring LLMs' practical use in educational assessment.
Contribution
It provides empirical evidence on GPT-4o's effectiveness for grading in real classrooms, highlighting its potential and limitations compared to human evaluators.
Findings
GPT-4o achieves up to 0.98 correlation with human scores
Exact score agreement in 55% of quiz cases
Strong overall alignment with human grading but variability in technical responses
Abstract
Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification
