Performance of the Pre-Trained Large Language Model GPT-4 on Automated Short Answer Grading
Gerd Kortemeyer

TL;DR
This paper evaluates GPT-4's effectiveness in automated short answer grading, comparing it to specialized models and analyzing its performance on standard datasets without additional training.
Contribution
It provides a comprehensive assessment of GPT-4's capabilities in ASAG tasks, highlighting its strengths and limitations relative to specialized models.
Findings
GPT-4 performs comparably to hand-engineered models.
GPT-4 underperforms compared to specialized pre-trained LLMs.
Withholding reference answers affects grading performance.
Abstract
Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Linear Layer · Residual Connection · Adam · Layer Normalization
