Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education
Andrea Gaggioli, Giuseppe Casaburi, Leonardo Ercolani, Francesco Collova', Pietro Torre, Fabrizio Davide

TL;DR
This study evaluates the reliability and validity of five advanced LLMs for automated student essay assessment, revealing significant inconsistencies and limited agreement with human judgments, thus emphasizing the need for human oversight.
Contribution
It provides a comprehensive empirical analysis of multiple LLMs' performance in real-world essay scoring, highlighting current limitations and variability in automated assessment accuracy.
Findings
Low human-LLM agreement across models
Weak intra-model reliability across replications
Moderate convergence for some criteria, poor for others
Abstract
This study investigates the reliability and validity of five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, for automated essay scoring in a real world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall's W < 0.30). Systematic scoring divergences emerged, including a tendency to inflate Coherence and inconsistent handling of context-dependent dimensions. Inter-model agreement analysis revealed moderate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Mental Health via Writing · Intelligent Tutoring Systems and Adaptive Learning
