Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Abigail Victoria Gurin Schleifer; Moriah Ariely; Beata Beigman Klebanov; Asaf Salman; Giora Alexandron

arXiv:2605.07647·cs.CL·May 11, 2026

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Abigail Victoria Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron

PDF

TL;DR

This study examines how different AI models, especially large language models, perform in scoring student responses, revealing that mid-range responses are scored less reliably, and adaptation to specific tasks affects this accuracy.

Contribution

It compares the scoring agreement of several LLMs and a fine-tuned model on biology responses, highlighting the impact of task-specific adaptation on scoring consistency.

Findings

01

Human agreement remains highest and stable across responses.

02

AI models perform well on clear-cut responses but struggle with mid-range responses.

03

Task-specific adaptation reduces mid-range scoring degradation, especially in fine-tuned models.

Abstract

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.