SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
Peichao Lai, Kexuan Zhang, Yi Lin, Linyihan Zhang, Feiyang Ye, Jinhao Yan, Yanwei Xu, Conghui He, Yilei Wang, Wentao Zhang, Bin Cui

TL;DR
SAS-Bench is a new benchmark designed to evaluate large language models' ability to score short answers with fine-grained, step-by-step assessments, addressing limitations of existing coarse scoring methods.
Contribution
We introduce SAS-Bench, a comprehensive dataset and evaluation framework for detailed, explainable scoring of short answers using large language models.
Findings
Few-shot prompting improves scoring accuracy.
Large language models face challenges in science question scoring.
The benchmark enables detailed analysis of model reasoning and errors.
Abstract
Subjective Answer Grading (SAG) plays a crucial role in education, standardized testing, and automated assessment systems, particularly for evaluating short-form responses in Short Answer Scoring (SAS). However, existing approaches often produce coarse-grained scores and lack detailed reasoning. Although large language models (LLMs) have demonstrated potential as zero-shot evaluators, they remain susceptible to bias, inconsistencies with human judgment, and limited transparency in scoring decisions. To overcome these limitations, we introduce SAS-Bench, a benchmark specifically designed for LLM-based SAS tasks. SAS-Bench provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types derived from real-world subject-specific exams. This benchmark facilitates detailed evaluation of model reasoning processes and explainability. We also…
Peer Reviews
Decision·Submitted to ICLR 2026
* SAS-bench aims to provide fine-grained analysis and explanations behind LLM-based SAS scoring, as well as actionable feedback. These problems are highly relevant in the EduNLP community. * The approach to splitting answers into reasoning steps and evaluating each step for correctness and error analysis is intuitive to obtain fine-grained analysis. * The authors release the SAS-bench publicly, containing 1030 questions from a real-world exam (China’s National College Entrance Examination(Gaokao
* Since the primary contribution is a dataset, the synthetic nature of the student responses needs to be justified as well-aligned to responses from real-world test student takers. Reference responses are first generated by only 3 students, thereby lacking diversity. LLMs are then prompted to diversify and introduce errors to generate the final set of positive and negative responses. How well do LLMs perform at this synthetic task? Each LLM-based synthetic step should be evaluated for performanc
S1: The detailed, step-by-step score annotation as well as the error type annotation is quite useful and a good addition to existing datasets. S2: The introduced scores for overall, step-by-step and error are sensible and the evaluation in terms of number of LLMs quite extensive.
W1: The student answers seem to be mostly generate by LLMs. Some answers were generated by only six students. The distribution is not clear. How many are from students? How many of the eight generated answers per question are disregarded? W2: The student answers are not real answers, collected by students that actually have taken the test. This means that the dataset most likely does not contain many of the patterns found in real student responses such as empty and half-completed answers. W3:
- The paper is clearly written and well-structured. - The figures are clean and easily understandable. - The evaluation is thorough; a large quantity of reasoning and non-reasoning methods are tested along varied metrics that capture both LLM and expert consistency (CCS), as well as LLM error consistency (ECS). - The benchmark and dataset is made up of multiple domains, with many examples: over 1,000 questions as well as over 4,000 expert annotations.
- The motivation is unclear to me: in the introduction, works like Zuang et al., 2024, Deshpande et al., 2024, and Raina et al., 2024 are cited, as works that show the shortcomings of LLMs-as-judges. As far as I can tell, none of these works deal with SAS, which results in the question: why are these works used as motivation for creating an SAS benchmark? Additionally, the gap presented in Appendix J also seems somewhat insignificant. In general, it seems like LLMs already perform relatively we
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification
