Are Large Language Models Good Essay Graders?

Anindita Kundu; Denilson Barbosa

arXiv:2409.13120·cs.CL·September 23, 2024·2 cites

Are Large Language Models Good Essay Graders?

Anindita Kundu, Denilson Barbosa

PDF

Open Access

TL;DR

This study assesses the effectiveness of Large Language Models like ChatGPT and Llama in automated essay scoring, revealing they tend to under-score and poorly align with human evaluations, but show potential as grading assistants.

Contribution

It provides a comprehensive evaluation of LLMs for essay grading, comparing their scores with human ratings and analyzing various scoring features and models.

Findings

01

LLMs generally assign lower scores than humans.

02

Scores from LLMs do not correlate well with human scores.

03

Llama 3 performs better than earlier models.

Abstract

We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. More precisely, we evaluate ChatGPT and Llama in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in Education. We consider both zero-shot and few-shot learning and different prompting approaches. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset, a well-known benchmark for the AES task. Our research reveals that both LLMs generally assign lower scores compared to those provided by the human raters; moreover, those scores do not correlate well with those provided by the humans. In particular, ChatGPT tends to be harsher and further misaligned with human evaluations than Llama. We also experiment with a number of essay features commonly used by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsLLaMA