Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Andrea Gaggioli; Giuseppe Casaburi; Leonardo Ercolani; Francesco Collova'; Pietro Torre; Fabrizio Davide

arXiv:2508.02442·cs.CY·August 5, 2025

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Andrea Gaggioli, Giuseppe Casaburi, Leonardo Ercolani, Francesco Collova', Pietro Torre, Fabrizio Davide

PDF

Open Access

TL;DR

This study evaluates the reliability and validity of five advanced LLMs for automated student essay assessment, revealing significant inconsistencies and limited agreement with human judgments, thus emphasizing the need for human oversight.

Contribution

It provides a comprehensive empirical analysis of multiple LLMs' performance in real-world essay scoring, highlighting current limitations and variability in automated assessment accuracy.

Findings

01

Low human-LLM agreement across models

02

Weak intra-model reliability across replications

03

Moderate convergence for some criteria, poor for others

Abstract

This study investigates the reliability and validity of five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, for automated essay scoring in a real world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall's W < 0.30). Systematic scoring divergences emerged, including a tendency to inflate Coherence and inconsistent handling of context-dependent dimensions. Inter-model agreement analysis revealed moderate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mental Health via Writing · Intelligent Tutoring Systems and Adaptive Learning