LLMs Do Not Grade Essays Like Humans
Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa

TL;DR
This paper evaluates how well large language models' automated essay scores align with human grading, revealing limited agreement and highlighting differences in scoring patterns and feedback coherence.
Contribution
It provides a comprehensive analysis of LLMs' scoring behavior compared to humans without task-specific training, revealing their potential and limitations in automated essay scoring.
Findings
LLMs tend to score short or underdeveloped essays higher.
LLMs assign lower scores to longer essays with minor errors.
Scores from LLMs are consistent with the feedback they generate.
Abstract
Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Authorship Attribution and Profiling · Intelligent Tutoring Systems and Adaptive Learning
