LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wr\'oblewska

TL;DR
This paper evaluates how well instruction-tuned LLMs align with human essay scoring, revealing biases and the impact of prompt design, and suggests bias correction strategies for educational assessment.
Contribution
It systematically assesses LLM scoring agreement, bias, and prompt effects across multiple datasets, highlighting the importance of bias correction over fine-tuning.
Findings
LLMs achieve moderate to high agreement on holistic scores (QWK about 0.6).
Models show stable negative bias on Lower-Order Concern traits like Grammar.
Concise keyword prompts outperform longer rubric prompts in analytic scoring.
Abstract
Despite growing interest in using Large Language Models (LLMs) for educational assessment, it remains unclear how closely they align with human scoring. We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring. We analyze agreement with human consensus scores, directional bias, and the stability of bias estimates. Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring. In particular, we observe large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions, meaning that models often score these traits more harshly than human raters. We also find that concise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
