Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

Longwei Cong; Sonja Hahn; Sebastian Gombert; Leon Camus; Hendrik Drachsler; Ulf Kroehne

arXiv:2605.00238·cs.CL·May 14, 2026

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne

PDF

TL;DR

This paper introduces an item response theory framework to evaluate LLM-based automated short answer grading, providing detailed insights into response difficulty and model performance variations beyond traditional aggregate metrics.

Contribution

The study applies IRT to analyze LLM grading performance, revealing how accuracy varies with response difficulty and identifying factors influencing grading errors.

Findings

01

Models with similar overall performance differ in accuracy decline with response difficulty.

02

Errors on difficult responses often involve the partially_correct_incomplete label.

03

Higher response difficulty correlates with weaker semantic alignment and greater semantic isolation.

Abstract

Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.