LLMs Do Not Grade Essays Like Humans

Jerin George Mathew; Sumayya Taher; Anindita Kundu; Denilson Barbosa

arXiv:2603.23714·cs.AI·March 26, 2026

LLMs Do Not Grade Essays Like Humans

Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa

PDF

Open Access

TL;DR

This paper evaluates how well large language models' automated essay scores align with human grading, revealing limited agreement and highlighting differences in scoring patterns and feedback coherence.

Contribution

It provides a comprehensive analysis of LLMs' scoring behavior compared to humans without task-specific training, revealing their potential and limitations in automated essay scoring.

Findings

01

LLMs tend to score short or underdeveloped essays higher.

02

LLMs assign lower scores to longer essays with minor errors.

03

Scores from LLMs are consistent with the feedback they generate.

Abstract

Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Authorship Attribution and Profiling · Intelligent Tutoring Systems and Adaptive Learning