Has Automated Essay Scoring Reached Sufficient Accuracy? Deriving Achievable QWK Ceilings from Classical Test Theory
Masaki Uto

TL;DR
This paper introduces dataset-specific QWK ceilings derived from classical test theory to evaluate the maximum achievable accuracy of automated essay scoring models, clarifying current performance limits.
Contribution
It proposes two novel QWK ceiling metrics based on reliability, providing theoretical and practical benchmarks for AES model performance evaluation.
Findings
Theoretical ceiling represents the maximum QWK with perfect true score prediction.
Human-like ceiling reflects QWK achievable with human-level scoring error.
Experiments show current AES models are below the theoretical ceiling, indicating room for improvement.
Abstract
Automated essay scoring (AES) is commonly evaluated on public benchmarks using quadratic weighted kappa (QWK). However, because benchmark labels are assigned by human raters and inevitably contain scoring errors, it remains unclear both what QWK is theoretically attainable and what level is practically sufficient for deployment. We therefore derive two dataset-specific QWK ceilings based on the reliability concept in classical test theory, which can be estimated from standard two-rater benchmarks without additional annotation. The first is the theoretical ceiling: the maximum QWK that an ideal AES model that perfectly predicts latent true scores can achieve under label noise. The second is the human-like ceiling: the QWK attainable by an AES model with human-level scoring error, providing a practical target when AES is intended to replace a single human rater. We further show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
