LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias

Filip J. Kucia; Anirban Chakraborty; Anna Wr\'oblewska

arXiv:2604.00259·cs.CL·April 2, 2026

LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias

Filip J. Kucia, Anirban Chakraborty, Anna Wr\'oblewska

PDF

TL;DR

This paper evaluates how well instruction-tuned LLMs align with human essay scoring, revealing biases and the impact of prompt design, and suggests bias correction strategies for educational assessment.

Contribution

It systematically assesses LLM scoring agreement, bias, and prompt effects across multiple datasets, highlighting the importance of bias correction over fine-tuning.

Findings

01

LLMs achieve moderate to high agreement on holistic scores (QWK about 0.6).

02

Models show stable negative bias on Lower-Order Concern traits like Grammar.

03

Concise keyword prompts outperform longer rubric prompts in analytic scoring.

Abstract

Despite growing interest in using Large Language Models (LLMs) for educational assessment, it remains unclear how closely they align with human scoring. We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring. We analyze agreement with human consensus scores, directional bias, and the stability of bias estimates. Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring. In particular, we observe large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions, meaning that models often score these traits more harshly than human raters. We also find that concise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.