Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset
Owen Henkel, Libby Hills, Bill Roberts, Joshua McGrane

TL;DR
This study evaluates GPT-4's ability to grade short-answer reading comprehension questions using a new dataset from Ghana, showing near-human performance and highlighting potential for scalable formative assessment.
Contribution
Introduces a novel dataset for short-answer grading in a low-resource context and empirically demonstrates GPT-4's high accuracy in grading student responses.
Findings
GPT-4 achieved QWK 0.92 and F1 0.89, near human performance.
The dataset is from Ghana, expanding evaluation beyond high-income countries.
Minimal prompt engineering sufficed for high-quality grading.
Abstract
Open-ended questions, which require students to produce multi-word, nontrivial responses, are a popular tool for formative assessment as they provide more specific insights into what students do and don't know. However, grading open-ended questions can be time-consuming leading teachers to resort to simpler question formats or conduct fewer formative assessments. While there has been a longstanding interest in automating of short-answer grading (ASAG), but previous approaches have been technically complex, limiting their use in formative assessment contexts. The newest generation of Large Language Models (LLMs) potentially makes grading short answer questions more feasible. This paper investigates the potential for the newest version of LLMs to be used in ASAG, specifically in the grading of short answer questions for formative assessments, in two ways. First, it introduces a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Topic Modeling · Text Readability and Simplification
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide) · Sparse Evolutionary Training · Attention Is All You Need · Label Smoothing · Byte Pair Encoding · Dense Connections · Position-Wise Feed-Forward Layer · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer
