Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Michael Hardy

arXiv:2603.04820·cs.CL·March 27, 2026

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Michael Hardy

PDF

Open Access

TL;DR

This meta-analysis reveals that current LLMs struggle with short-answer scoring tasks, especially on easier items for humans, and highlights technological and bias-related shortcomings affecting their performance.

Contribution

The study provides a comprehensive meta-analytic assessment of LLM short-answer scoring, identifying key technological limitations and biases affecting performance.

Findings

01

LLMs underperform compared to humans on easy scoring tasks.

02

Decoder-only architectures lag behind encoder models by 0.37 in agreement.

03

Tokenizer vocabulary size shows diminishing returns, indicating undertraining issues.

Abstract

Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques