ASAG2024: A Combined Benchmark for Short Answer Grading
G\'er\^ome Meyer, Philip Breuer, Jonathan F\"urst

TL;DR
The paper introduces ASAG2024, a comprehensive benchmark combining multiple datasets for short answer grading, to evaluate and compare automated grading systems across various subjects and scales.
Contribution
It presents the first unified benchmark for short answer grading, enabling systematic evaluation of SAG methods across diverse datasets and scales.
Findings
LLM-based SAG approaches outperform previous methods
Current automated systems still lag behind human grading performance
The benchmark facilitates future research in generalizable SAG solutions
Abstract
Open-ended questions test a more thorough understanding than closed-ended questions and are often a preferred assessment method. However, open-ended questions are tedious to grade and subject to personal bias. Therefore, there have been efforts to speed up the grading process through automation. Short Answer Grading (SAG) systems aim to automatically score students' answers. Despite growth in SAG methods and capabilities, there exists no comprehensive short-answer grading benchmark across different subjects, grading scales, and distributions. Thus, it is hard to assess the capabilities of current automated grading methods in terms of their generalizability. In this preliminary work, we introduce the combined ASAG2024 benchmark to facilitate the comparison of automated grading systems. Combining seven commonly used short-answer grading datasets in a common structure and grading scale.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Self-Attention Guidance · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
