Are Large Language Models Good Essay Graders?
Anindita Kundu, Denilson Barbosa

TL;DR
This study assesses the effectiveness of Large Language Models like ChatGPT and Llama in automated essay scoring, revealing they tend to under-score and poorly align with human evaluations, but show potential as grading assistants.
Contribution
It provides a comprehensive evaluation of LLMs for essay grading, comparing their scores with human ratings and analyzing various scoring features and models.
Findings
LLMs generally assign lower scores than humans.
Scores from LLMs do not correlate well with human scores.
Llama 3 performs better than earlier models.
Abstract
We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. More precisely, we evaluate ChatGPT and Llama in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in Education. We consider both zero-shot and few-shot learning and different prompting approaches. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset, a well-known benchmark for the AES task. Our research reveals that both LLMs generally assign lower scores compared to those provided by the human raters; moreover, those scores do not correlate well with those provided by the humans. In particular, ChatGPT tends to be harsher and further misaligned with human evaluations than Llama. We also experiment with a number of essay features commonly used by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsLLaMA
