Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

Hongli Li; Che Han Chen; Kevin Fan; Chiho Young-Johnson; Soyoung Lim; Yali Feng

arXiv:2512.14561·cs.CL·December 17, 2025

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

Hongli Li, Che Han Chen, Kevin Fan, Chiho Young-Johnson, Soyoung Lim, Yali Feng

PDF

Open Access

TL;DR

This research synthesis evaluates the reliability of large language models in automatic essay scoring by analyzing 65 studies, revealing moderate to good agreement with human raters but highlighting variability and reporting inconsistencies.

Contribution

It systematically reviews and synthesizes recent empirical studies on LLMs in essay scoring, providing a comprehensive overview of their agreement levels with human raters.

Findings

01

LLMs show moderate to good agreement with human raters.

02

Agreement indices mostly range between 0.30 and 0.80.

03

Significant variability and reporting inconsistencies across studies.

Abstract

Despite the growing promise of large language models (LLMs) in automatic essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLMs and human raters in AES. Across studies, reported LLM-human agreement was generally moderate to good, with agreement indices (e.g., Quadratic Weighted Kappa, Pearson correlation, and Spearman's rho) mostly ranging between 0.30 and 0.80. Substantial variability in agreement levels was observed across studies, reflecting differences in study-specific factors as well as the lack of standardized reporting practices. Implications and directions for future research are discussed.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Psychometric Methodologies and Testing · Reliability and Agreement in Measurement